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ABSTRACT 


The  ptdbleiSi  of  repfesenting  speech  signals  in  a  format  which 
will  facilitate  automatic  speech  transcription  has  been  investigated^  The 
method  of  representation  selected  for  experimental  study  involves  the 
transformation  of  speech  signals  into  sequences  of  periodically  sampled 
outputs  of  speech  parameter  extractors,  i.  e.  §  devices  designed  to  extract 
clues  from  speech  signals  which  will  serve  to  identify  the  language  element 
being  uttered,  Automatic  extractors  have  been  constructed  and  data  has 
been  collected  to  ascertain  the  degree  to  which  speech  sounds  can  be 
identified  properly,  using  several  parameters  reflecting  the  location  of 
formants  and  spectral  shape  information. 


Methods  of  completing  the  transformation,  or  transcription,  of 
speech  into  sequences  of  language  elements  suitable  for  presentation  to 
a  human  reader  have  also  been  investigated.  Test  results  indicate  that 
the  most  easily  instrumented  transcription  methods  can  be  ejected  to 
yield  readable  transcriptions  from  the  use  of  a  small  number  of  speech 
parameters. 
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i  INTRODUCTiON 

The  general  problem  with  which  this  study  has  been  conGerned  is 
that  of  determining  efficient  methods  Of  transforming  speech  signals  into 
sequences  of  language  elements  suitable  for  presentation  either  to  a  human 
or  to  a  machine^  In  the  case  of  a  human  recipient,  the  transformed  speech 
should  convey  the  same  information  as  would  be  possible  through  the  use  of 
a  human  stenographer  and  typist.  Thus,  a  machine  designed  to  implement 
the  speech  transformation  methods  might  be  called  a  "phonetic  typewriter".  * 
If  the  set  of  language  elements  into  which  the  speech  signals  are  transformed 
Consists  of  a  phonetic  alphabet,  then  such  a  machine  could  be  designated 
more  accurately  as  an  automatic  speech  transcriber. 

An  automatic  speech  transcription  capability  is  applicable  to 
essentially  any  communications  problem  involving  (a)  human  speech  as  an 
information  source  or  relay,  (b)  a  temporary  or  permanent  storage  require*^ 
mehti  and  (c)  a  need  for  rapid  human  assimilation  of  the  information.  A 
person  can  read  printed  matter  at  the  rate  of  many  hundreds  of  words  per 
minute;  however,  a  speaker  generates  information  at  a  considerably  lower 
rate.  To  achieve  the  higher  rate  of  assimilation,  speech  must  be  eonverted 
by  some  means  to  printed  English.  All  current  methods  of  conversion  from 
speech  to  printed  text  involve  either  at  lesst  one  additional  person  or  a  con« 
siderable  dslay  or  both.  An  automatic  speech  transcription  device  would 
replace  the  e^ra  individual  as  well  as  eliminate  or  reduce  significantly  the 
transcription  delay. 

In  addition  to  and  perhaps  more  important  than  its  utility  as  a  trans’’ 
Criber  of  text,  a  speech  transcription  technique  inherently  carries  with  it  the 
capability  for  voice  control  of  machines.  Thus,  for  instance,  instead  of  the 
depression  of  keys,  pedals,  buttons  and  the  like  as  a  means  of  feeding  inform 
mation  into  a  computer,  a  speech  transcriber  with  a  word  recognition  unit 
could  be  used  to  program  as  well  as  insert  data  into  the  computer.  Thus, 
in  many  applications  involving  the  transfer  of  human^ generated  instructions 
to  machinesi  a  speech  transcriber  can  serve  to  relieve  the  human  from  the 
burden  of  having  to  learn  new,  and  usually  relatively  slow  methods  of  com^ 
munication,  by  allowing  the  use  of  a  iiatural  method;  a  spoken  language. 


^See  [^] ,  Each  reference  in  this  report  is  indicated  by  a  number  enclosed 
in  brackets,  The  Reference  List  at  the  end  of  the  report  identifies  the 
numbers  with  full  descriptions  of  the  references, 
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Thg  speeifiG  purpose  o£  the  GUrrent  projeGt  under  ContraGt  Noj 
AF30(602) “2641  is  to  find  an  optimum  format  for  representation  of  speeGh 
signals  to  facilitate  automatic  speech  transcription.  It  is  desired  that  the 
derived  representation  he  Oiptimized  with  respect  to  accuracy  of  represen¬ 
tation,  storage  requirements,  and  ease  of  implementation. 

The  derivation  of  such  a  representation  requires  that  suitable 
measurable  speeeh  signal  properties,  or  parameters,  be  found  which 
serve  to  preserve  the  linguistic  information  in  speech,  and  also  serve 
as  suitable  inputs  to  a  language  element  recognizer.  The  extraction  of 
these  parameters  may  be  regarded  as  a  transformation,  or  mapping, 
from  "Speech  signal  space"  to  "parameter  space".  As  depicted  in  Figure 
1,  this  transformation,  T^^,  is  to  be  followed  by  another  T^,  which  would 
complete  the  conversion  of  speech  to  readable  form  by  mapping  the  elements 
of  parameter  space  into  a  space  of  language  elements ^  Although  the  primary 
purpose  of  this  study  has  been  to  investigate  the  initial  transformation,  Tp 
results  have  also  been  obtained  for  a  few  methods  of  completing  the  trans¬ 
cription  of  speech,  i.  e. ,  performing  using  a  phonetic  alphabet  as  the 
language  element  spacer 


The  approach  taken  on  this  project  has  been  to  investigate  first 
the  accuracy  of  representation  of  speech  attainable  with  the  simplest  form 
of  implementation  and  minimum  extracted  speech  data.  Through  the  sys« 
tematic  augmentation  of  extracted  speech  parameters  and  refinement  of 
recognition  methods,  the  following  results  have  been  assured: 

1)  Speech  representation  accuracy  will  always  improve  as  further 
effort  is  expended,  and 

2)  Reliable  relationships  between  representation  aceuracy,  inform- 
mation  storage  requirements,  and  equipment  simplieity  Will  be  obtained, 
from  which  a  judgment  as  to  an  optimum  Gombination  can  be  made, 

In  Section  2  of  this  report  the  theory  underlying  these  transcription 
methods  is  reviewed,  The  basis  for  Using  a  phonetic  alphabet,  rather  than 
some  other  set  of  language  elements,  is  presented  along  with  discussions 
of  the  problem  of  selecting  appropriate  speech  parameters  and  methods  of 
processing  parameter  values  to  recognize  letters  in  the  phonetic  alphabet, 
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The  representation  of  speech  in  parameter  SpaGes  formed  by  two 
Gombinations  of  speech  parameters  is  disGuSSed  in  Seotion  3.  Data  are 
presented  which  indicate  the  storage  requirements  and  aceuracy  associated 
with  these  representations. 

Specific  transcription  and  word  recognition  methods  are  described 
in  Section  4.  Results  of  experiments  conducted  to  ascertain  performance 
capabilities  Of  these  methods  are  also  included  in  the  fourth  section,  for 
one  speech  parameter  space.  ■ 

Conclusions  regarding  the  type  of  speech  representation  which  will 
prove  most  useful  for  preserving  the  information  content  of  speech,  and 
also  serve  as  a  convenient  means  of  implementing  automatic  transcription 
techniques  are  presented  in  Section  5. 


2.  f  HEORETIGAL  basis  fob.  AUtOMATIC  SPEECM  TBANSGRlPTiON 


The  transformation  of  speech  into  readable  text  involves  two  basic 
steps,  or  SmbSidiary  transformations.  As  previously  noted  (Figure  1)  the 
first  involves  association  of  a  set  of  parameter  values  with  each  possible 
speech  signal.  The  second  involves  the  association  of  language  elements 
with  patterns  of  parameter  values.  Derivation  of  these  two  subsidiary 
transformations  requires  that  a  suitable  list  of  parameters  be  selected, 
and  a  method  be  devised  for  associating  patterns  of  these  parameters  with 
language  elements.  Also,  a  specific  set  of  language  elements  must  be 
seleeted  These  three  aspects  of  the  speech  transformation  problem  ate 
discussed  in  the  following  subsections. 


2. 1  SELEGTIOH  OF  LANGUAGE  ELEMENTS 

Several  possibilities  have  been  given  consideration  as  language 
elements.  The  most  frequently  listed  elements  are  words,  syllables, 
phenemes,  and  phonetic  elements,  or  "sounds".  Several  investigations 
have  been  conducted  to  determine  the  feasibility  of  using  each  of  these 
language  elements  as  a  basis  for  transcription,^  From  the  standpoint  of 
ease  of  interpretation  by  a  human  reader,  words  are  the  most  attractive 
language  elements.  However,  with  these  elements  the  problem  of  selecting 
suitable  parameters  for  representing  speech  is  difficult  to  solve  within 
reasonable  limits  on  equipment  complexity  and/or  storage  requirements. 
Perhaps  the  basic  renson  for  this  is  that  the  number  of  words  required  to 
represent  a  reasonably  broad  class  of  speech  signals  is  large.  A  rudi«' 
mentary  vocabulary,  consists  of  several  hundred  words,  This  fact  creates 
several  obstacles  to  the  construction  of  an  automatic  transcriber  using 
words  as  the  language  elements.  Notable  among  these  is  the  difficulty  of 
selecting  parameters  which  are  useful  for  separating  more  than  a  few  words. 
Since  words  are  composed  of  sequences  of  the  intervals  of  speech  corres* 
ponding  to  different  states  of  the  speech  source,  it  is  clear  that  parameters 
must  be  constructed  in  such  a  way  as  to  produce  different  values  for  these 
Sequences,  This  requirement  suggests  that  parameters  should  be  chosen 
by  examining  a  given  collectien  of  words  and  selecting  features  of  these 
specific  words  which  tend  to  separate  them.  If  the  vocabulary  is  to  remain 
the  same,  then  this  can  produce  a  satisfactory  result.  If,  however,  the 
vocabulary  is  ever  augmented,  or  even  changed  by  replacement,  then 
there  is  no  guarantee  that  the  selected  parameters  will  produce  a  reasonable 
separation  of  the  new  words.  Thus,  from  the  standpoint  of  either  restricting 
the  speech  which  can  be  transformed  satisfactorily  or  requiring  major  changes 
in  the  operations  involved  in  speech  representation  in  parameter  space,  words 
are  unattractive  as  language  elements, 

*See  for  instencVE?!  for  wdr^s,  [16]  for  syllables,  and  fs]  for  phonetic 
elements  or  phonemes. 
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A  related,  praGtical  difficulty  which  arises  from  the  uSe  of  vx>td8 
as  language  elements  is  that  the  number  of  positions  in  parameter  space 
which  can  conceivably  correspond  to  words  is  large.  If,  for  instance, 
periodic  speech  samples  (spaced  A  seconds  apart)  are  quantized  in  some 
manner  with  q  possible  different  values,  then  the  number  of  possible 
positions  in  parameter  space  is  q*^/ ^  ,  where  T  is  an  indeation  of  the 

^  T 

duration  of  a  spoken  word.  For  most  words,  ^  is  greater  than  10,  and  if 

q  =  10  (a  conservative  assumption)  then  parameter  space  may  consist  of 
as  many  as  10^®  different  points.  Of  course,  if  parameters  are  constructed 
from  sequences  of  speech  samples,  then  this  number  can  be  reduced  tre¬ 
mendously.  The  selection  of  such  parameters,  however,  very  likely  cannot 
be  accomplished  through  the  systematic  examination  of  combinations  of 
parameters  as  suggested  in  Section  2. 1,  because  the  number  of  different 
words  and  speech  samples  involved  in  a  respectable  vocabulary  would  be 
too  large.  Letting  w  denote  the  number  of  words  in  a  vocabulary,  and  u 
denote  the  number  of  utterances  of  each  word  that  would  be  used  as  a  basis 
for  learning  the  distribution  of  words  in  parameter  space,  suppose  it  is 
desired  that  all  combinations  of  k  parameters  out  of  n  candidates  be  examined. 

This  Would  require  that  |^|  (w)  (n)|^|  speech  samples  be  processed,  For 

^  =  10  samples  per  word,  u  =  10  utterahces,  n  -  10  parameters,  and  k  =  8, 

then  450,  000  speech  samples  would  have  to  be  proeessed  to  obtain  an  indi¬ 
cation  of  the  dstribution  of  only  lOO  words  in  the  spaces  formed  by  all  com¬ 
binations  of  the  eight  parameters.  If  60  samples  are  obtained  each  second, 
then  approximately  40  hours  of  speech  would  ly.ve  to  be  processed  to  obtain 
the  required  &ta. 


Another  problem  which  arises  from  the  use  of  words  as  language 
elements  is  that  speech  signals  inherently  must  be  segmented  by  some 
means  into  intervals  of  time  corresponding  to  utterances  of  words,  The 
transitions  and  other  charscteristics  of  signals,  including  silence  intervals, 
apparently  do  not  offer  an  xinambiguous  basis  for  performing  this  segmen¬ 
tation,  This  problem  alone  serves  to  restrict  the  use  of  words  as  basic 
language  elements  to  the  representation  of  words  spoken  in  isolation.  For 
continuous  speech,  most  of  the  effort  in  recent  yesrs  has  been  applied  to 
the  investigation  of  syllables  or  subsyllabic  langimge  elements* 
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If  syllables  are  used  aS  language  elements,  then  some  of  the 
difficulties  associated'  with  the  use  of  words  are  aineliorated.  The 
Correspondence  between  source  states  and  syllables  involves  shorter 
sequences  of  intervals  during  which  speech  signals  do  not  change  signi* 
ficantly,  and  the  number  of  different  syllables  required  to  represent 
a  wide  variety  of  speech  is  somewhat  smaller*  than  the  number  of  words 
in  a  eomprehensive  vocabulary.  Unlike  words,  syllables  need  not  (and 
probably  cannot)  be  defined  in  a  way  which  exactly  corresponds  to  linguistic 
syllabification.  One  method  which  has  been  under  study**  for  several  years 
employs  syllables  defined  as  patterns  of  parameter  values  corresponding  to 
utterances  of  standard,  short  words.  In  this  system,  the  parameters  con¬ 
sist  of  presence  or  absence  of  threshold  crossings  at  the  outputs  of  a  filter 
bank,  sampled  at  several  different  times.  The  samples  are  taken  at  times 
corresponding  to  significant  changes  in  the  speech  signal.  With  8  filters 
and  5  samples  per  syllable,  the  parameter  space  consists  of  2'^^  possible 
patterns.  Very  likely,  only  a  small  percentage  of  these  patterns  would 
ever  occur  as  the  result  of  speech  signals.  In  [6  ],  for  instance,  it  is 
suggested  that  ten  to  fifteen  different  patterns  arise  from  a  given  syllable, 
and  if  1000  syllables  are  required  to  adequately  represent  speech,  then 
approximately  lO'^  different  patterns  of  parameter  values  would  be  used« 
assuming  negligible  overhip  between  syllables.  This  number  places  the 
use  of  syllables  within  the  realm  of  practicality.  The  design  of  "Exact 
Match"  devices  for  associating  patterns  of  parameter  values  with  syllables 
can  exploit  "either-or",  "always ^present",  and  "never  present"  conditions 
for  each  of  the  40  binary  parameters  corresponding  to  a  filter  and  sampling 
instant.  For  any  single  syllable^  the  "never  present"  condition  will  exist 
for  most  of  the  parameters^  thus  allowing  for  construction  Of  a  relay  "tree"> 
consisting  of  only  a  few  relays,  for  recognition  of  each  syllable. 

As  remarked  above,  the  two  primary  ways  in  whi''.h  the  use  of 
syllables  constitutes  an  improvement  over  the  use  of  words  as  language 
elements  are  (a)  the  number  of  significant  changes  which  occur  in  speech 
Signals  during  intervals  corresponding  to  language  elements  is  reduced, 
and  (b)  the  number  of  different  language  elements  needed  to  adequately 
represent  speech  is  reduced.  These  changes  permit  the  use  of  a  smaller 
parameter  space,  and  simplify  language  element  recognition  (IQOO  syllables 
instead  of  perhaps  5000  words  for  a  comprehensive  vocabulary).  To  some 
investigators,  it  appears  that  the  use  of  sub’^syllabic  language  elements 
would  offer  even  greater  simplification  of  the  speech  recognition  problem 


*Tt  is  suggested  in  [6]  that  IQOO  syllables  would  suffice  to  accurately 
represent  an  unrestricted  vocabulary. 
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by  the  same  means;  redaction  in  the  size  of  parameter  space,  and  re^ 
duction  in  the  number  of  language  elements  (i.  e, ,  the  number  of  alters 
natives  to  which  each  pattern  of  parameter  values  must  be  assigned). 
Representation  of  speech  with  phonemes ,  for  instance,  has  been  the  goal 
of  several  in%'estigationSi  As  with  syllables,  the  definition  of  phonemes 
for  the  purpose  of  automatiG  speech  transcription  necessarily  differs 
from  the  linguistic  definition.  *  For  automatie  speech  transcription,  a 
phoneme  consists  of  those  patterns  of  parameter  values  which  result 
from  utterances  judged  by  either  a  human  or  other  means  to  be  a  dis® 
tuictive  speech  soundi  If  the  judgment  is  made  by  a  human,  then  these 
language  elements  eomprise  a  phonetic  alphabet.  From  the  standpoint 
of  ease  of  interpretation. by  a  human  reader,  a  phonetic  alphabet  evidently 
would  be  quite  satisfactory »  Although  the  reader  would  be  required  to 
learn  the  alphabet,  this  can  be  accomplished  quite  easily. 

In  view  of  the  fact  that  words  and  syllables  are  composed  of 
sequences  of  phonetically  distinguishable  intervals  of  speech,  it  might 
be  expected  that  speech  signals  will  change  less  during  intervals  assigned 
to  symbols  in  a  phonetic  alphabet  than  during  intervals  which  would  be 
assigned  to  syllables  or  words.  Thus,  it  is  possible  that  a  smaller 
parameter  space  may  suffice  to  distinguish  between  phonetic  speech 
elements  than  is  required  for  the  longer  elements.  However,  it  has 
been  contended  that  no  matter  what  parameters  are  used,  the  variations 
in  manifestations  of  different  speech  sounds  (in  different  environments, 
from  different  speakers,  etc.)  in  parameter  space  are  so  large  that 
separation  of  these  sounds  is  not  possible.  The  question  of  feasibility 
of  separation  of  speech  sounds  (in  a  parameter  space)  raised  by  these 
contrary  points  of  view  probably  can  be  resolved  in  the  affirmative  only 
by  demonstration,  i,  e. ,  by  developing  operations  which  actually  produce 
different  outputs  corresponding  to  different  sounds. 

If  such  a  parameter  space  can  be  found,  then  the  use  of  phonetic 
elements  to  represent  speech  vastly  simplifies  the  problem  of  associating 
patterns  of  parameter  values  with  the  language  elements.  Approximately 
40  phonemes  are  eonsidered  sufficient  to  adequately  represent  speech. 

Thus,  the  number  of  alternatives  for  assignment  of  a  pattern  of  parameter 
values  is  only  a  few  dozen,  compared  with  a  thousand  or  more,  as  would 
be  required  for  adequate  representation  with  syllables  or  words, 

♦AcGording  to  [4],  "a  phoneme  is  the  minimum  feature  of  the  expression 
system  of  a  spoken  language  by  which  one  thing  that  may  be  said  is  distinguiahed 
from  any  bther  thing  which  might  have  been  said". 

’t^’i^It  has  been  suggested  that  a  phonetic  alphabet  facilitates  reading,  and  has 
been  adopted  for  use  in  a  few  schools.  See,  for  instance,  i\Z]. 

»8  = 


In  keeping  with  examining  simple  methods  first,  we  have  directed 
our  attention  on  this  project  to  the  use  of  phonetic  elements.  As  will  be 
shown  in  Section  3,  enough  separation  between  some  of  these  elements  can 
be  achieved  with  a  minimal  parameter  set,  to  indicate  that  addition  of  ether 
parameters  will  provide  essentially  non  over  lapping  patterns  of  parameter 
values  Corresponding  to  different  speech  sounds. 

Rather  than  dwell  on  the  distinctions  between  linguistic  and  oper¬ 
ationally  defined  phonemesj  we  have  somewhat  arbitrarily  set  up  a  phonetic 
alphabet  which  consists  of  symbols  corresponding  to  intervals  of  speech 
during  which  very  little  change  can  be  detected  acoustically .  These  symbols, 
and  examples  of  words  whose  normal  pronunciation  produces  speech  sounds 
correspondng  to  these  symbols,  are  listed  in  Table  1.  Also  indicated  are 
phonemes  whose  utterances  produce  the  speech  soiuids.  The  phonetic 
elements  are  labeled  in  an  arbitrary  but  suggestive  way  which  allows  for 
Gonvenient  print-out  from  the  general  purpose  digital  computer  with  which 
transcription  methods  are  simulated. 

1.  2  SELECTION  OF  SFEECM  PARAMETERS 

The  problem  of  extracting  clues  from  a  speech  signal  which  contain 
sufficient  information  to  identify  the  language  elements  being  uttered  can  be 
formulated  and  attacked  in  two  somewhat. different  ways.  One  approach  to 
the  problem  consists  of  drawing  up  a  list  of  features  of  speech  which  are 
phonetically  distinguishable  by  humans  and  which  it  is  believed  will  serve 
to  classify  speech  signals  into  sequences  of  phonetic  language  elements. 

These  features  generally  correspond  to  different  states  of  the  human  speech 
source,  i.  e, ,  articulatory  states  of  the  vocal  trac^  -.,fpr  example,  the  vocal 
Cord  vibration  rate,  the  mouth  opening,  and  positions  of  the  tongue  and  lips, 
Since  the  correspondenee  between  source  states  and  the  generation  of  phonetic 
language  elements  is  relatively  well  knovn,  the  representation  of  speech  as  a 
sequence  Qf  these  language  elements  can  readily  be  solved  if  a  means  can  be 
devised  tp  measure  automatically  the  phonetic.  Or  "distinctive"*,  features 
of  speech. 

An  example  of  such  a  list  of  distinctive  features  is  shown  in  Table  2.  ** 
As  indicated  in  die  table,  determination  of  the  presence  of  absence  of  ten 
speech  features  is  evidently  sufficient  for  a  human  to  distinguish  between  35 
different  phonemes  (which  could  be  used  to  represent  English  quite  adequately). 


TABLE  1  -  PHONETIC  ALPMABEf 


SOWD  ‘ 

EXAMPLES 

SOUND  GROUP 
CHARAGTERISTIGS 

Or  oup 

r - -  1 

Number 

DesignatiSn 

IPA  Phonemels) 

1 

i  AW 

1 

ALL 

2 

00 

u 

POOL,  WAIL 

3 

.  u 

u 

pull 

4 

OR 

BIRD,  MAKER 

§ 

AH 

<^yjD 

FATHER,  ODD 

VOWELS 

I 

6 

;  UH 

SUN,  SOFA 

7 

0 

0,  ou 

NOTATION.  GO 

8 

A 

a,  ae 

ASK,  SAT 

9 

EH 

c 

SET 

1 

10 

1 

I 

SIT 

M 

1  EE 

i,  j 

BEET,  YOU 

12 

'  L 

i 

Luyu 

II 

13  1 

R 

X 

REAR _ 

LIQUIDS 

14 

W 

W 

WAIL 

15 

M 

m 

^IM 

NASAL 

III 

16 

N  1 

n 

NOON 

CONSONANTS 

j 

n 

NO, 

■b 

SING 

i 

18 

B 

b 

BIB 

VOICED 

IV 

19 

P 

d 

DEED 

STOP 

20 

G 

g 

GIVE 

CONSONANTS 
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TABLE  1  (Cont.) 


SOUND 

EXAMPLES 

SOUND  GROUP 

C  HARACf  ERISTICS 

Group 

Des 

ignation 

Numfeer 

IPA  PhoHei^e(s) 

21 

Z 

z 

ZONE 

V 

22 

V 

V 

VALVE _ 

VOICED 

23 

Tj  ; 

dr 

EI^ER 

FRICATIVE 

24 

,  ZH 

VISION 

CONSONANTS 

25 

T 

t 

TOOT _ _ 

UNVOICED 

VI 

26 

p 

p 

peep 

STOP 

27 

K 

k 

GAKE 

consonants 

VII 

28 

H 

h 

MAIL 

29 

WH 

hw 

,  WHALE 

UNVOICED 

VIII 

30 

F 

f 

i  El£S 

31 

TH 

0 

THIN 

FRlCAflVE 

IX 

32 

S: 

8 

CEASE _ 

33 

SH 

X 

MI|£ION 

;  CONSONANTS 

X 

34 

CM 

t/ 

church 

AFFRICATES 

35 

DJ 

JUDGE 

The  basic  diffictalty  with  this  approach  arises  from  the  need  to 
develop  operations  which  can  be  performed  on  the  speech  waveform  to 
determine  the  presence  or  abiSence  of  the  specified  distinctive  features. 
Although  study  of  the  mechanisms  by  which  the  Speech  source  generates 
language  elements  has  yielded  considerable  knowledge  of  speech  wave¬ 
form  characteristics,  notably  energy  distributions  in  time  and  frequency, 
no  reliable  correspondence  between  such  measurable  characteristics  and 
the  presence  or  absence  of  distinctive  features  (as  judged  by  humans)  has 
as  yet  been  developed.  If  this  approach  to  the  speech  processing  problem 
is  pursued  vigorously,  then  major  emphasis  is  inevitably  placed  on  attempts 
to  develop  better  ways  to  determine  presence  or  absence  of  the  distinctive 
features. 

Although  this  approach  recognizes  the  basic  problem  of  representmg 
speech  in  terms  of  measurable  parameters,  it  tends  to  deify  certain  pre^ 
selected  parameters  as  those  which  should  be  used  to  classify  language 
elements.  Unfortunately,  mechanizations  of  the  judgment  of  parameter 
values  (i«  e. ,  presence  or  absence  of  distinctive  features)  have  genetally 
proved  unsatisfactory  in  one  way  or  another. 

The  other  general  approach  to  the  speech  processing  problem 
differs  from  the  first  primarily  in  the  way  in  which  parameters  are  selected. 
First,  parameters  are  considered  to  be  defined  only  in  terms  of  operations 
performed  on  the  speech  waveform.  Although  considerable  guidance  in  the 
selection  pf  suitable  operations  for  distinguishing  Speech  sounds  is  provided 
by  knowledge  of  the  manner  in  which  humans  solve  the  problem,  no  pre¬ 
determined  list  of  speech  characteristics  or  distinctive  features  is  drawn 
up  as  an  unalt  erable  goal.  Instead,  several  candidates  for  useful  parameters 
are  selected  not  only  on  the  basis  Of  their  possible  potential  for  classifying 
language  elements,  but  also  on  the  basis  of  ease  of  implementation.  These 
parameters  are  examined  to  determine  which  language  elements  can  be 
classified  through  their  measurement.  By  study  of  combinations  of  para¬ 
meters  it  is  possible  to  pinpoint  the  language  element  confusions  which 
remain  to  be  resolved,  as  well  as  the  combination  of  parameters  which 
achieves  the  greatest  language  element  separation.  Generally,  it  is  ex^ 
pected  that  a  close  examination  of  ^e  parameter  values  associated  with 
language  elements  not  distinguished  'with  the  initial  set  of  parameters  will 
yield  suggestions  for  other  operations  or  parameters  which  will  serve  to 
classify  these  elements.  By^  systematically  introducuig  new  parameters  for 
the  purpose  of  resolving  these  remaining  language  element  ambiguities,  it  is 
anticipated  that  an  implementable  set  of  speech  parameters  will  be  obtained 
which  is  sufficient  to  classify  b(.nguage  elements.  We  have  pursued  this 
course  using  phonetic  elementSt  or  speech  sounds,  as  the  language  elements.’)' 

♦SeeT’able  I  in  Section  %.  1. 
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It  should  be  emphasized  at  the  outset  that  from  the  standpoint  Of 
attaining  Or  improving  a  speeGh  transcription  eapabilityi  speech  parameters 
should  be  selected  on  the  basis  of  the  degree  to  which  different  combinations 
of  parameter  values  are  oibtained  for  utterances  of  different  sounds.  This 
does  not  mean  that  different  parameter  values  will  result  for  different  sounds,, 
when  parameters  are  eonsidered  individuallyi  To  illustrate  this  point,  con= 
sider  the  diagram  in  Figure  2.  As  depicted  in  the  diagram,  two  sounds  may 
result  in  values  of  two  parameters  which  are  quite  similar,  and  involve  con¬ 
siderable  overlap  between  the  two  sounds,  when  either  parameter  is  measured 
alonei  However,  j imu ita neo us  measurement  of  the  two  parameters  may 
produce  a  non- overlapping  distribution  of  the  two  sounds  in  the  2 -dimensional 
parameter  space,  as  shown  in  the  diagram.  This  rather  elementary  obser¬ 
vation  suggests  that  (a)  the  systematic  introduction  of  new  parameters,  as 
outlined  above,  will  produce  an  efficient  speech  representation  in  terms  of 
storage  requirements,  and  (b)  essentially  rules  out  the  selection  of  speech 
parameters  solely  on  the  basis  of  separation  of  single  parameter  values 
arising  from  different  sounds. 


In  this  study  we  have  undertaken  initially  to  develop  a  set  of  speech 
parameters  which  serve  to  distinguish  primarily  between  voiced  sounds*  The 
speech  mechanism  for  voiced  sounds  may  be  thought  of  as  an  acoustic  pulse 
generator  (the  vocal  cords)  exciting  a  multiply  resonant  cavity  (the  vocal 
tract,  including  nasal  cavities),  The  vibration  rate  of  the  vocal  cords  is 
commonly  associated  with  the  pitch  frequency.  The  several  resohances, 
each  of  approximately  90  Gps  in  bandwidth,  are  several  times  higher  in 
frequency  than  the  fundamental  and  vary  considerably  from  sound  to  sound 
and  exhibit  some  variation  from  speaker  to  speaker.  These  resonances 
of  the  vocal  tract,  called  formants,  give  rise  to  local  peaks  in  the  energy 
spectra  of  speech  samples.  The  location  of  these  peaks  may  be  regarded 
as  indications  of  formant  positions  for  voiced  sounds.  It  has  long  been 
recognized  that  formant  characteristics  serve  to  distinguish  fairly  well 
between  the  vowel  sovuids,  and  also  carry  considerable  information  on 
other  voiced  soimds,  We  have  therefore  chosen  to  use  an  estimate  of 
formant  positions  (location  of  spectrum  peaks)  as  the  initial  set  of  para¬ 
meters.  The  automatic  extraction  of  local  spectrum  peaks  can  be  accom- 
plished  with  a  spectrum  analyzer  of  the  type  commonly  used  in  vocoders. 


A  second  set  of  parameters  has  been  selected  to  obtain  information 
on  the  shape  of  the  energy  density  spectra  of  speech  signalSi  Study  of 
"sections"  of  speech  (i,  e. ,  energy  density  spectra)  on  an  audio  signal 
analyzer,  or  observation  of  commutated  samples  of  vocoder  channel  out¬ 
puts  displayed  on  an  oscilloscope,  provides  a  strong  indication  that  different 
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Figure  2*  Hypothetical  Pistrihution  of  Two  Speech 
Sounds  in  a  Two^Pirnensipnal  Faraineter 
Space 
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sounds  give  rise  to  signifiGant  differences  in  spectral  sfeapeSi  Various 
possibilities  exist  for  obtaining  spectral  Shape  information.  Rather  than 
develop  a  number  of  operations,  each  of  which  would  be  designed  to  detect 
spectrum  shapes  associated  with  a  few  sounds,  we  have  investigated 
initially  a  set  of  parameters  which  is  not  only  expected  to  provide  useful 
information  on  all  sounds  {including  unvoiced  sounds),  but  also  is  easily 
instrumented  with  a  minimum  of  adjustment. 


The  two  characteristics  of  any,  possible  unknown,  function  of  a 
quantity  x  (say,  f(x)),  which  have  come  to  be  regarded  a-s  perhaps  the 
most  important  for  characterizing  the  shape  of  f(x),  are  the  first  moment 
about  the  origin  and  the  second  moment  about  the  mean.  Considering  the 
energy  density  spectrum  of  a  speech  sample  as  a  function  of  frequency 
S(f),  these  quantities  can  be  defined  for  speech  by 


M 

^  ^  "  Spectrum  mean 

“o 


“2  2  I 

o  =  <  '  M  >  ■  spectrum  Spread 

^0 


where  s  ^  S(f)  df  =  spectrum  area,  or  spectrum  zero-^th  moment 
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M,  i  \  f  S(f)  df  “  spectrum  first  moment 


and 


M_  ^  '  f  S(f)  df  =  spectrum  second  moment, 

Cl 


It  is  expected  that  different  values  of  ju  and  o  will  result  from  utterances 
of  different  speech  sounds.  To  obtain  m  and  a  it  is  sufficient  to  measure 
Mg,  Mj,  and  Mr* each  of  which  can  be  obtained  through  linear  operations 
on  the  outputs  01  a  vocoder  spectrum  analyzer.  While  these  quantities 
may  not  be  sufficient  to  differentiate  between  all  sounds,  it  is  anticipated 
that  their  measurement  will  afford  a  significant  improvement  in  trans^ 
cription  capability  over  that  attainable  with  spectral  peaks  alone,  for 
unvoiced  as  well  as  voiced  sounds* 
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A  third  set  of  parameters i  whieh  would  serve  to  advanGe  trans^ 
Gription  capability  significahitly,  cohsists  of  measurements  to  ascertain 
those  times  at  which  parameters  of  the  type  described  above  are  under > 
going  relatively  rapid  changes.  These  changes  would  serve  to  segment 
speech  into  intervals  corresponding  to  either  the  language  elements  being 
used  to  represent  speech^  or  some  other  elements  from  which  the  language 
elements  can  be  ascertained.  By  processing  the  information  obtained  in 
such  a  segment  of  speech  before  rendering  a  decision  as  to  which  sound  is 
being  utter ed^  the  reliability  of  decisions  can  be  improved  over  the  perform 
mance  attainable  through  decisions  rendered  more  frequently. 

Assuming  that  a  set  of  measurements,  or  operationSi  to  be  per¬ 
formed  On  a  speech  signal  have  been  formulated,  consider  now  the  behavior 
of  speech  as  represented  in  parameter  space.  This  space  can  be  formed 
by  considering  each  parameter  to  be  a  coordinate  direction  in  (for  con¬ 
venience  in  visualization)  a  rectangular  coordinate  space.  A  speech  signal^ 
s(t),  could  then  be  represented  in  vector  form. 

s{t)— ^  v(t)  =  [v,(t),  v.(t),...»  vjt)], 

where  v.(t)  is  the  time-varying  result  of  the  operation  on  the  speech  signal 
defined  fey  the  i-th  parameter;  i.  e*  >  v  (t)  indicates  the  point  in  the  n^dimen- 
sionai  parameter  space  (formed  by  the  parameters  v^,  v^, . . . ,  v  )  into 
which  the  speech  signal  is  mapped,  at  the  time  instant,  t,  As  speech  is 
uttered,  the  point  ^  moves  about  in  parameter  space  in  some  manner  cor¬ 
responding  to  the  sequence  of  sounds  being  uttered.  If  a  good  set  of  para¬ 
meters  has  been  selected,  then  the  point  will  lie  within  different  regions 
of  parameter  space  during  intervals  corresponding  to  different  soimds, 
Solution  of  the  speech  transcription  problem  requires  that  such  a  set  of 
good  parameters  be  found,  and  an  easily  implemented  method  be  devised 
for  describing  the  regions  in  parameter  space  corresponding  to  the 
different  sounds. 


As  a  first  step  towards  simplification  of  equipment,  it  is  possible 
to  quantize  speedi  into  intervals  of  time  during  which  the  point  ^  changes 
insignificantly.  Since  a  speech  signal  envelope  has  a  bandwidth  of  about 
25  cps,  no  significant  information  loss  is  suffered  if  the  envelope  detected 
outputs  of  a  vocoder  spectrum  analyzer  are  sampled  periodically,  at  a 
rate  of  SO  samples  per  second  or  higher.  *  Further,  any  parameters  which 


^This  assertion  has  been  verified  through  listening  tests  with  speech  synthe 
sizers  utilizing  periodic  speech  samples  as  inputs.  Speaker  fidelity  is  also 
preserved, 
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are  obtained  through  linear  Operations  on  the  speeoh  signal  envelope 
tnay  also  he  sampled  at  the  same  rate  with  essentially  no  loss  in  infor¬ 
mation.  ThuSt  a  speech  signal,  s(t)i  may  he  Represented  as  a  sequence 
of  positions  in  parameter  space,  descrihed  in  vector  form  hy 

•  v^. . vj,  i=l,  2...,. 

where  the  suhscript  "i"  indicates  die  position  in  parameter  space  occupied 
at  the  i-th  sample  instant.  The  time  separation  of  adjacent  samples.  A, 
must  he  no  greater  than  approximately  20  milliseconds. 

As  a  second  step  toward  simplification  of  equipment,  it  will  he 
desirahle  to  reduce  the  number  of  possible  positions  which  can  he  occupied 
in  parameter  space  by  quantization  of  the  speech  parameters.  However, 
in  quantizing  parameter  values,  considerable  care  must  he  exercised  to 
avoid  the  Creation  of  ambiguities  in  parameter  space  which  are  large  with 
respect  to  the  separation  of  di^erent  sounds  (as  represented  in  parameter 
space).  This  problem  deserves  as  much  attention  as  the  selection  of 
parameters,  since  quantization  itself  is  one* of  the  operations  which  defines 
a  parameter- 


The  potential  of  several  operations  constituting  parameters  of  the 
type  suggested  above  for  distinguishing  between  vowel  sounds  is  examined 
in  Section  3.  The  critical  importance  Of  proper  quantization  of  parameter 
values  is  demonstrated  with  one  of  these  parameters. 
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PATTERN  RECOGNiTION  METHODS 


2.  3 

The  seleGtion  ef  speech  parameters  which  take  on  different 
values  during  intervals  of  speech  corresponding  to  utterances  of 
different  sounds  constitutes  the  firsts  and  most  important  step  toward 
achieving  an  automatic  transcription  capability^  Once  a  set  of  para¬ 
meters  has  been  selected  which  serve  to  separate  speech  sounds  in 
parameter  space,  the  problem  of  associating  patterns  of  parameter 
values  with  language  elements  arises^  This  problem  consists  of  two 
parts i  First,  the  distribution  of  speech  sounds  in  parameter  space 
must  be  ascertainedi  i^  e^ ,  the  patterns  of  parameter  values  which 
arise  from  a  speech  sound  (and  preferably  their  relative  frequency  of 
occurrence)  must  be  found  for  all  sounds  in  the  phonetic  alphabeti 
Second,  a  satisfactory  means  of  partitioning  parameter  space  into 
regions  corresponding  to  the  different  speech  sounds  must  be  devised. 

If  it  can  be  ascertained  that  there  exist  no  points  (i.  e. .  patterns)  in 
parameter  space  which  ever  arise  from  utterences  of  more  than  one 
sound,  then  this  problem  is  trivial.  Implementation  of  the  decision 
boundaries  requires  only  that  the  equivalent  to  a  table  look-up  operation 
be  implemented. 

We  shall  refer  to  these  two  parts  of  the  problem  of  associating 
points  in  parameter  space  with  speech  sounds  as  (1)  finding  the  distri¬ 
bution  of  sounds,  and  (2)  establishing  decision  boundaries. 

The  Complicating  feature  oPthe  problem  of  finding  the  dstribution 
of  classes  of  events  (in  this  case,  sounds)  as  represented  in  a  parameter 
Space,  is  that  in  most  practical  situations  the  classes  are  known  only 
through  a  finite  set  of  Sample  events.  The  number  of  possible  patterns 
of  parameter  values  usually  exceeds  by  far  the  number  of  sample  events 
which  can  be  Obtained.  Thus,  solution  of  the  first  part  of  the  pattern 
recognition  problem  requires  that  some  doctrine  be  applied  to  decide 
whether  any  of  the  parameter  patterns  which  have  not  occurred  in  the 
sample  events  should  be  regarded  as  belonging  to  any  of  the  classes,  and 
if  so,  to  which  classes.  It  must  also  be  decided  whether  any  of  the  sample 
patterns  arising  from  events  belonging  to  one  class  might  also  ever  occur 
as  manifestations  of  events  belonging  to  another  class.  Since  the  number 
of  available  sample  events  is  usually  small,  one  is  faced  with  the  necessity 
for  constructing  some  conception  of  the  distribution  of  classes  in  para¬ 
meter  space,  based  on  incomplete  information  concerning  the  class 
association  of  a  sparse  collection  of  sample  points.  If  the  chosen  parameters 
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produce  tights  widely  separated  clusters  of  points  in  parameter  Space 
corresponding  to  the  different  classes,  then  a  few  samples  should  suffice 
to  "learn"  the  distribution  of  classes  Sufficiently  Well  to  avoid  incorfect 
associations  of  points  with  classes.  However,  if  the  parameters  do  not 
produce  such  a  situation,  then  either  many  samples  must  be  obtained  to 
learn  the  nature  of  the  distribution  of  classes,  or  if  for  some  reason 
this  is  not  possible,  a  possibly  hazardous  estimate  of  the  distribution 
must  be  made  on  the  basis  of  data  at  hand. 


Perhaps  the  most  complete  description  of  the  distribution  of 
speech  sounds  in  parameter  space  which  one  can  ever  hope  to  obtain  is 
the  probability  density  functions  of  points  in  parameter  space,  eon« 
ditioned  on  each  of  the  N  speech  soimds,  S^,,  i  -  I,  2  , . . . ,  N.  If  parameter 
space  consists  of  discrete  points  {as  in  the  case  when  all  parameter  values 
are  quantized),  then  the  probability  density  function  of  conditioned  on 
the  i-th  speech  sound,  p.(^  |  S.),  is  equivalent  to  the  probability,  P.ij^  |  $.), 
that  the  point  will  occur  in  parameter  space  when  the  i^th  speech 
sound  is  uttered.  The  fact  that  a  restricted  number  of  sample  patterns 
may  be  available  from  which  the  distribution  of  sounds  in  parameter  space 
can  be  inferred,  has  motivated  the  development  of  a  variety  of  methods  of 
estimating  the  nature^  or  particular  eharacteristics,  of  the  functions 
Pi(^  I  Sj),  using  a  limited  amount  of  data. 

Once  some  conception  of  the  distribution  of  speech  sounds  in 
parameter  space  is  obtained,  the  problem  of  partitioning  the  space  into 
non«oVerlapping  regions  corresponding  to  speech  sounds  can  be  attacked. 

A  method  of  establishing  decision  boundaries  which  has  come  to  be  re^^ 
garded  as  an  optimum  method  consists  of  calculating  the  Ukelihopd  that 
a  given  point,  y^,  has  arisen  from  the  i=-th  spund,  i  =  I,  2^,  .  .  .  ,  N,  and 
chop  sing  the  spund  fpr  which  this  quantity  is  highest,  If  the  a  prieri 
probability  of  pccurrence  of  a  speech  sound  is  assumed  to  be  the  same 
for  all  spwids,  then  this  Maximum  Likelihood  method  is  equivalent  to 
the  establishment  of  decision  boundaries  according  to  the  following  rule: 

If  P.(v  I  S.)  >  P.{v  ilS.)  for  i  =  1,  2^, .  .  .  ,  N,  then  associate  v 

with  the  j-'lh  sound. 

Thus,  the  maximum  likelihopd  method  requires  only  the  comparison  of 
values  of  the  functions  |  S.)  ^  ,  which  have  already  been  established 

as  goals  for  the  descriptipn  of  speech  spunds  as  represented  in  parameter 
space. 
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i  A  disrect  approach  to  the  problem  of  estimating  the  ftinctions 

yP.(v  j  S.)^  ,  consists  Of  constructing  histograms  Over  parameter  space 
from  a  large  number  of  independent  samples  of  each  of  the  speech  sounds. 
If  one  is  able  to  obtain  enough^  samples,  there  appears  to  be  no  better 
way  to  proceed. 


Most  of  the  pattern  recognition  methods  which  have  been  developed 
represent  attempts  to  exploit  some  a  priori  notions  of  the  functions  * 

in  the  estimation  of  certain  features  of  these  functions.  For 
insianee,  i\  may  be  assumed  that  each  of  these  ftanetions  is  unimodal,  i.  e* , 
possesses  a  single  local  maximum.  Under  this  assumption  the  use  of 
multiple  order  linear  dscriminant  functions  may  lead  to  good  results <  This 
method  involves  the  use  of  hyperplanes  of  the  form  y.  =  (a .  ‘  - 

n  ^  ^ 

/  a.,  V,  for  the  decision  boundaries,  where  the  coefficients,  a  .  = 

ik  k  ~i 

k=l 


(a..,  a.^, . . . ,  a_ ),  can  be  determined  in  a  variety  of  ways.  Although 

several  hyperplanes  can  be  used  to  bound  each  class  (speech  sound)  from 
each  of  the  other  classes,  it  has  rarely  been  suggested  that  any  more  than 
a  single  hyperplane  for  each  pair  of  classes  be  used  since  even  this  number 
becomes  intolerably  large  when  the  number  of  classes  is  greater  than  a 
dozen  or  so.  The  rather  great  reliance  which  has  been  placed  on  the  use 
of  linear  discriminant  functions  for  establishing  decision  boundaries  in 
classification  problems,  suggests  that  potential  hazards  associated  with 
their  use  have  not  been  fully  appreciated.  If  any  of  the  functions 

^.(v^  are  not  unimodal,  then  a  hyperplane  may  be  completely  in^ 


effective  in  partioning  parameter  space  into  regions  corresponding  to 
different  classes.  Consider  for  instance  the  hypothetical  distribution  of 
two  sounds  in  a  two = dim ens ions  1  parameter  space  as  depieted  in  Figure 
3,  Although  these  two  sounds  are  non»pVerlapping  and  even  tightly 
clustered  (in  multiple  modes)  and  widely  separated,  there  exists  no 
straight  line  which  can  be  drawn  to  completely  separate  fee  two  classes. 


Another  popular  method  consists  of  using  a  single  linear  form 
for  each  class.  This  method  consists  of  correlating  the  pattern, 
with  a  single  representative  of  the  i'th  class,  and  choosing  the 
class  for  which  the  correlation  is  highest,  after'normalization: 


*An  attendant  problem  is  that  an  implementable,  general  Criterion  by 
which  the  number  of  available  samples  can  be  judged  "enough"  or  not, 
has  not  yet  emerged  from  the  large  amount  of  study  which  has  been 
directed  to  the  question  over  the  years  , 
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Choose  the  j-th  class  if 

%  ‘  V 

^  £il  1x1 

for  all  i.  Again,  a  variety  of  methods  can  be  applied  to  determine 
suitable  Class  representatives,  The  most  common  ehoiee  is  the 
ample  mean  vector  of  each  class.  This  method  provides  the  same 
decision  boundaries  as  would  be  obtained  with  the  Maximum  Likelihood 
method,  if  all  the  classes  have  symmetric  Gaussian  distributions,  with 
equal  varianGes  in  the  parameter  spacer  If  the  classes  do  not  possess 
Such  distributions,  then  this  method  may  or  may  not  produce  good  results. 


Correlation  with  a  stored  reference  Gonstitutes  an  example  of 
the  treatment  of  pattern  recognition  as  a  two-step  problem  wherein 
(1)  a  representative  pattern  is  seiected  for  each  class,  and (2)  sample 
patterns  are  associated  with  classes  according  to  which  representative 
is  "closest"  to  the  sample,  where  the  "distance"  between  two  vectors 
is  defined  in  some  way.  The  attributes  of  a  variety  of  methods  of 
measuring  distance  have  been  studied  extensively*,  with  the  result 
that  as  Gonstraints  are  removed  from  the  form  to  which  the  measure 
of  distance  is  limited,  better  recognition  capability  is  achieved. 


A  related  approach**  consists  of  establishing  nonlinear  decision 
boundaries  in  parameter  space,  where  coefficients  of  second,  third,  and 
higher  powers  of  parameter  values  (in  contrast  with  cpefficients  of  linear 
terms)  are  selected  as  a  basis  for  fitting  complicated  Shapes  around 
regions  associated  with  different  classes.  Again,  a  variety  of  criteria 
can  be  applied  to  select  values  for  the  coefficients.  This  approach  provides 
much  greater  potential  for  separating  multi^^modal  distributions  than  linear  ' 
methods,  but  as  a  general  rule,  more  samples  are  also  required  to  obtain 
an  accurate  placement  of  the  nonlinear  boiuidaries. 


There  are,  of  course,  many  different  pattern  recognition  techniques 
which  have  been  developed  in  the  past  few  years,  some  of  which  offer  c©m  = 
putational  simplicity  in  lieu  of  potential  accuracy,  and  vice  versa.  Although 
the  techniques  mentioned  here  encompass  many  of  the  methods  which  have 


been  devglopgd,  there  are  tnafty  others  which  either  exist  now  or  which 
will  no  doubt  come  along  in  the  future.  The  question  of  which  method 
would  be  best  for  recognizing  speech  sounds  can  only  be  answered  by 
either  (a)  obtaining  a  large  number  of  samples  of  each  sound  from  which 

a  good  estimate  of  the  functions  I  be  obtained:,  or  (b) 

using  each  method  and  comparing  the  results.  In  the  course  of  this  study > 
we  have  stressed  the  developmenc  of  techniques  by  which  sufficient  data 

can  be  collected  to  estimate  the  functions  (»,<t  i®i)]  ■ 

Two  further  points  should  be  emphasized.  First,  if  speech 
parameters  can  be  found  which  produce  no  overlap  between  speech 
Sounds,  and  if  the  number  of  different  combinations  of  parameter  values 
is i small,  then  there  is  nothing  to  be  gained  in  using  any  special  pattern 
recognition  method  such  as  setting  up  a  particular  type  of  discriminant 
function,  It  would  be  sufficient  to  compare  a  speech  sample  with  a 
"reference  library"  of  patterns  and  associate  the  sample  with  that  sound 
to  which  the  reference  duplicate  of  the  sample  (if  one  exists)  corresponds. 
If  an  exact  match  does  not  occur  between  the  speech  sample  and  some 
member  of  the  reference  library,  then  a  variety  of  possibilities  exist. 

For  instance,  the  sound  corresponding  to  the  last  match  could  be  assumed 
to  persist.  Or  if  more  than  one  speech  sample  is  obtained  for  a  sound 
(as  is  the  case  for  most  speech  sounds)  then  the  decision  could  be  dsferred. 

The  implementation  of  such  a  method  can  be  extremely  simple 
if  the  number  Of  patterns  which  actually  arise  from  speech  is  not  un-> 
reasonably  large.  Although  it  has  been  contended*  that  speech  Sounds 
produce  too  many  manifestations  in  a  parameter  space  to  allow  this 
method  to  be  employed,  no  proof  of  this  contention  has  been  provided, 

To  do  so  it  would  be  necessary  to  show  that  all  possible  parameter 
spaces  would  produce  wide  variations,  since  the  variation  within  each 
speech  sound  ss  represented  in  parameter  space  is  determined  by  the 
parameters  themselves,  On  the  other  hand,  the  fact  that  intelligible 
Speech  is  produced  by  speech  synthesizers  operating  on  parametric 
representations  of  speech  indicates  that  parameters  may  exist  which 
are  relatively  invariant  for  a  single  speech  sound,  und  yet  produce 
different  values  for  different  sounds. 


*For  instance  in  [6], 
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The  other  point  is  that  if  more  than  one  deGision  is  made  during 
an  mterval  cor  responding  to  a  singie  speech  sound,  then  it  is  not  neGessary 
tiiat  a  match  be  obtained  for  every  speech  sample,  in  order  to  use  the 
Exact  Mat^  method  outlined  above.  Thus,  it  may  be  possible  to  reduce 
the  reference  library  to  a  relatively  small  number  of  patterns  of  parameter 
values,  without  sacrificing  the  accuracy  with  which  speech  sounds  can  be 
identified.  In  the  next  section^  results  of  a  few  experiments  are  presented 
as  a  basis  for  an  initial  estimate  of  the  degree  to  which  a  given  size  refer¬ 
ence  library  can  be  expected  to  cover  (i,  e. ,  match)  all  possible  samples 
of  speech  sounds. 


3.  SPEECH  REPRESENT AflON  IN  PARAMETER  SPACE 

As  discussed  in  Section  2.  it  has  been  anticipated  that  positions 
of  local  max  ima  in  sample  speech  energy  density  spectra  will  provide 
Sufficient  infprmation  to  distinguish  fairly  well  between  the  vowel  sounids. 
and  will  also  serve  as  useful  clues  for  identifying  other  voiced  sounds. 

Also,  it  is  highly  likely  tliat  vowel  sounds  cannot  be  recognized  adequately 
if  some  indication  of  formant  positions  is  not  available.  Therefore,  we 
have  considered  spectral  peak  patterns  to  constitute  a  minimum  parameter 
set  for  voiced  sounds.  The  ways  in  which  these  estimates  of  the  formant 
positions  and  other  speech  parameters  have  been  extracted  in  this  investi¬ 
gation  are  described  in  the  first  of  the  following  two  subsections.  In  the 
second  subsection,  the  results  of  an  inyestigation  to  determine  the  distri¬ 
bution  of  sounds  in  the  resulting  parameter  spaces  are  reported  for  two  ‘ 
combinations  of  parameters. 

3, 1  SPEECH  parameter  EXTRACTION 

Although  intuitive  conceptions  of  speech  parameters  can  pe  described 
in  terms  that  are  readily  accepted  and  understood  by  everyone,  the  problem 
of  extracting  numerical  values  of  speech  parameters  in  a  way  whieh’will  be 
deemed  satisfactory  by  even  a  few  people  is  still  a  difficult  one  to  solve. 

For  instance,  to  ascertain  whether  a  particular  method  of  extractingi  i#  e*  • 
estimating,  formant  positions  is  satisfactory  or  not^  it  can  be  contended  that 
measurements  of  the  vocal  cavities  in  the  speech  source  must  be  recorded 
simultaneously  and  compared  with  the  numerical  values  obtained  for  the 
estimates  of  formant  positions.  However,  as  pointed  out  in  $ection  2.  Z,  it 
is  not  absolutely  necessary  that  the  question  of  how  well  a  parameter  is 
being  measured  ever  be  raised.  If  one  adopts  the  point  of  view  that  a  para¬ 
meter  can  be  defined  precisely  only  in  terms  of‘.phe  operations  actually 
performed  on  the  speech  waveform  to  obtain  parameter  values,  then  per 
force  the  parameter  is  always  extracted  properly.  The  appropriate  question 
then  becomes'.  "What  are  the  operations  to  be  performed  on  the  speech  wave¬ 
form  (i.  e, ,  parameters)  to  obtain  different  numerical  values  for  die  different 
Speech  sounds?". 

To  answer  this  question,  the  quantities  listed  in  Table  3  have  been 
chosen  as  initial  candidates  for  suitable  parameters  to  facilitate  speech 
transcription.  In  the  course  of  this  study,  methods 'Of  extraction  have  been 
devised  and  applied  for  each  of  these  quantities,  The  specific  operations 
performed  on  the  speech  waveform  to  obtain  numerical  values  of  these 
quantities  are  described  in  die  following  paragraphs. 
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TABLE  3  -  LIST  OF  SPEECH  PARAMETERS  INVESTIGATED 


Notatidn 


Location  of  Peaks  in  Speech  Sample  Enejfgy  Density  ^  -  (p^i  p^i. 

Spectrum  (quantised  into  18  frequency  channeis)^a 
indication  of  formants 


Ratio  of  outputs  of  high  pass  and  bandpass  filters V 
voicing  indication 

Area  under  Speech  Sample  Energy  Density  Spectrum--  M^ 

Speech  Sample  Signal  Energy 


First  Moment  of  Speech  Sample  Energy  Density  Spectrum^ & 
Spectrum  Mean 

Second  Moment  of  Speech  Sample  Energy  Density  Spectrum- - 
with  the  first  moment  and  spectrum  area,  a  measure  of 
Spectrum  Spread 

Input  Speech  Envelope  Amplitude 

Normalized  Speech  Envelope  Amplitude 

Sum  of  Magnitudes  of  Forward  Differences  of  Samples  of  the 
above  parameters -'"Speech  segment  boundaries 


M 

M. 


E 


AS 


The  block  diagram  in  Figure  4  indiGates  the  major  processing 
steps  involved  in  the  approaGh  being  taken  to  speeGh  tranSGription  in  this 
study.  These  are  extraction  and  periodic  sampling  of  speech  parameters 
such  as  listed  in  Table  t,  further  operation  on  and  quantization  of  these 
samples,  and  association  of  patterns  of  the  resulting  quantized  parameters 
with  phonetic  elements. 

1. 1  Descripticn  of  Equipment 

To  obtain  ^ta  as  well  as  demonstrate  the  feasibility  of  this  approach 
to  speech  transcription,  laboratory  equipment  has  been  constructed  in  the 
Communieation  Sciences  Laboratory  for  use  on  this  and  other  spee^  pro¬ 
cessing  projects.  The  Gonfiguration  of  this  equipment  for  the  parameters 
listed  in  Table  3  is  shown  in  Figure  5.  The  diree  functions  performed  are 
(1)  Speech  signal  conditioning,  (2)  parameter  extraction,  and  (3)  data  format 
conversion. 

For  signal  conditioning,  the  first  operation  performed  on  a  speech 
signal  is  pre^emphasis  of  the  high  frequencies.  The  pre«emphasis  network 
serves  to  accentuate  the  relatively  weak  higher  formants  of  voiced  speech 
signals,  and  the  sometimes  extremely  low-level  unvoiced  speech  sounds. 
Although  several  adjustments  of  the  pre^emphasis  network  were  made 
during  the  course  of  this  study,  its  final  characteristic  consists  of  approx¬ 
imately  6  db  gain  per  octave  above  on*  kcps,  The  envelope  detected  out- 
put  of  this  network,  E^,  was  stu<tied  as  a  possible  candidate  for  a  normalizing 
parameter  for  others. 


After  pre^emphasis,  the  speech  signal  is  next  passed  through  an 
AGC  network.  This  network  constrains  the  output  to  less  than  6  db  vari¬ 
ation  for  a  ZOdb  range  in  the  input  signal  level.  The  envelope  detected 
output  of  this  network,  E  ,  was  also  studied  as  a  possible  normalizing 
factor  for  other  parameterp. 

After  AGC,  the  signal  is  passed  to  an  IS-channel  parallel  filter 
bank.  The  characteristics  of  these  filters  are  indicated  in  Figure  6. 


In  the  current  experimental  setup,  the  IS-channel  filter  baidc  output, 
denoted  -  (fj,  fg’  ’  ”  ’  parameter  extractors^  the  peak 

picker,  the  zero-th,  first  and  second  moment  calculators,  and  the  speech 
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and  Speech  Signal  Mapping 
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Figure  5  Block  Diagrain  of  Experimental  Speech  |^3Q- 

Proceeeing  B<|uipmeht 


Figure  6  iLdttOQ  Vocoder  S.pectr\iin'  Analyser  F  ilte 
Bank  Response 


segment  boundary  indicator.  The  peak  picker  is  a  unit  designed  to  locate 
and  identify  those  channels  in  which  the  energy  density  spectrum  of  the  past 
few  milliseconds  of  speech  possesses  a  local  majdmum.  This  unit  operates 
as  indicated  in  Figure  The  envelope  detected  output,  x^,  of  the  n-th 
filter  (channel)  is  fed  to  two  aiijacent  comparators  (C),  one  of  which  is  pre* 
ceded  by  a  circuit  (<))'*)  which  passes  the  greatest  of  two  inputs.  One  of  these 
two  inputs  .s  X  ,  and  the  other  input  is  a  constant,  K.  The  circuits  serve 
to  prevent  the  indication  of  a  peak  in  a  channel  unless  the  peak  amplitude  is 
greater  than  the  adjustable  threshold,  K,  If  both  comparators  produce  outputs 
indicating  that  the  quantity  X  is  larger  than  ^hd  then  a  peak  is  in¬ 

dicated  at  the  Output  by  routing  the  adjacent  comparator  outputs  to  an  AND 
gate  corresponding  to  the  n-th  channeli 


In  view  of  their  central  role  in  the  identification  of  voiced  sounds, 
it  is  Of  interest  to  note  the  degree  to  which  the  peak  patterns  (as  obtained 
with  Litton' s  18  channel  vocoder  and  peak  picking  unit)  correspond  to  other 
methods  of  extracting  indications  of  formant  positions.  Although  a  thorough 
investigation  has  not  been  undertaken  as  yet,  a  few  comparisons  with  a 
conventional  method  of  extracting  formant  position  estimates  by  hand  show 
that  the  peak  pieking  method  produces  quite  similar  results.  Specifically, 
a  method  of  measuring  (i.  e. ,  estimating)  formant  positions  which  has  been 
employed  for  years  involves  tracing  (by  hand)  indications  of  local  energy 
peaks  as  observed  in  spectrograms.  This  method  can  be  compared  with 
the  automatic  peak  picking  method  by  simply  quantising  the  tracings  into 
the  same  18  frequency  channels  employed  in  the  peak  picker,  using  the 
same  speech  sample  for  each  method. 

The  result  of  such  a  comparison  is  shown  in  Figure  8  for  the  word 
"ONE".  Although  there  are  differenoes  in  the  two  estimates  of  formant 
positions,  it  is  not  clear  which  of  these  methods  provides  a  more  accurate 
indication.  In  order  to  assess-^aCCurately  the  quality  of  either  method,  it 
would  be  necessary  to  make  csreful  measurements  on  the  speech  source, 
as  well  as  detailed  recordings  of  the  speech  source  states  (mouth  opening, 
tongue  position,  etc, )  during  the  utterance.  It  can  be  concluded  from  an 
exaniination  of  several  such  comparisons  between  the  two  methods  of  esti¬ 
mating  formant  positions,  that  the  peak  picking  method  tends  to  produce 
formant  indications  which  resemble  very  closely  those  obtained  by  hand. 


The  moment  extractors  implement  the  calculation  of  zero^i-th, 
first  and  second  moments  of  the  speech  energy  density  spectrum.  In 
terms  of  the  filter  bank  outputs,  f^  ^  (f^,  f^,  •  •  •  ,  f^g)>  these  parameters 
can  be  written 
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Figure  7.  Spectriiitn  Peak  Picker 


V  =  0,  I,  1. 
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Approximations  to  these  quantities  have  been  obtained  through  the  use  of 
resistive  adders.  Actually,  beeause  Of  the  relatively  low  Skirt  seleOtivity 
of  the  filters  in  the  speGtrum  analyzer,  the  weightings  employed  in  the 
equipment  have  been  changed  slightly  (from  k*^)  to  compensate  for  the 
overlap  between  Ghannels.  The  adjustment  was  made  to  produce  an 
appropriate  output  of  each  moment  extractor  for  a  sinewave  input.  The  ^ 
resulting  M  extractor  produces  an  essentially  constant  output  as  a  Gon^ 
stant  amplitude  sinewave  input  is  tuned  over  the  18  channels,  and  the 
and  extractors  produce  outputs  as  indicated  in  Figure  9. 

The  segment  boundary  indicator,  AS,  is  designed  to  detect  changes, 
in  speech  signals  which  correspond  to  transitions  between  speech  sounds. 
The  Current  method  under  study  consists  of  adding  the  magnitudes  of  the 
derivatives  of  envelopes  of  the  filter  bank  outputs: 

18 

I 

k=i 


ft 

dt 


Although  comprehensive  testing  of  transcription  methods  employing  this 
means  of  speech  segmentation  have  not  been  completed,  the  boundaries 
created  by  thresholding  AS  do  provide  a  reasonalde  correspondence  between 
Speech  segments  and  utterances  of  phonetic  language  elements.  This  seg^ 
mentation  is  illustrated  in  Table  4  for  the  words  "Two  Three".  The  Speech 
segments  indicated  in  this  Table  have  been  obtained  by  quantizing  AS  into 
8  levels,  and  regarding  the  occurrence  of  the  second  or  higher  levels  as 
transitions. 


Two  more  parameters,  a  voicing  indication  (V)  and  pitch  (F)  are 
available  in  the  current  experimental  equipment,  but  were  not  used  in  this 
study  since  most  of  the  data  obtained  was  for  voiced  sounds,  and  pitch 
provides  little  a#itional  speech  information. 
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TABLE  4.  SEGMENTATiOM  OF  "TWO  THtlEE"  WITH  THE  PARAMETER  AS 
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As  iMicated  in  Figufe  S  (and  disGUssed  in  Section  below) 

eaGb  extracted  pararneter  is  quantized  to  permit  its  representation  as  a 
binary  number  in  a  data  format  eonversion  unit.  A  digital  Gommutation 
converts  the  binary  numbers  resulting  for  all  parameters  into  a  sequence 
of  5-bit  samples  suitable  for  direct  insertion  into  the  Recomp  11  computer. 
The  pattern  of  parameter  values  resulting  from  a  single  speech  sample  is 
fed  into  one  eomputer  word  location  (eapacity  40  bits).  For  convenience, 
the  sampling  interval,  A,  has  been  chosen  equal  to  the  time  it  takes  to 
complete  one  complete  drum  revolution  in  the  computer,  thus  producing 
60  speech  samples  per  second.  Since  the  computer  has  a  capacity  of 
4000  words,  up  to  approximately  a  minute  of  continuous  speech  can  be 
processed  at  one  time. 


The  signals  produced  at  any  point  in  the  block  diagram  in  Figure  5 
are  available  for  display.  For  instance,  the  (sampled)  output  of  the  peak 
picking  unit  can  be  displayed  on  an  oscilloscope  as  an  intensity  modulated 
sawtooth  waveform  with  60  sweeps  per  second,  This  display  can  be  re¬ 
corded  to  produce  a  representation  of  a  segment  of  speech  as  a  sequence 
of  "instantaneous"  spectra.  As  illustrated  in  Figure  lO,  an  easily  inter¬ 
preted  Segment  of  one  second  of  speech  is  economically,  permanently  and 
COttveniently  stored  in  a  single  3"  by  4"  print.  In  this  recording,  a  spectrum 
peak  in  a  given  channel  is  indicated  by  a  white  mark  in  that  channel,  Three 
channels  are  represented  between  each  adjacent  pair  of  horizontal  grid  lines 
in  the  photograph,  Another  parameter,  the  voicing  indication,  has  been  re¬ 
corded  in  this  photograph  in  the  two  positions  above  the  top  grid  line  in  the 
photograph.  Until  the  data  format  conversion  equipment  became  available 
toward  the  end  of  this  project,  photographs  of  this  type  Were  used  to  obtain 
data  for  the  parameter  space  formed  by  spectral  peaks  and  the  voicing 
indication.  Before  the  peak  picker  itself  was  Constructed,  the  program 
outluied  in  Appendix  I  was  used  to  simulate  its  operation  on  the  Recomp  li 
computer,  Quantized  sequences  of  sample  spectra  were  obtained  as  inputs 
to  this  program  through  the  courtesy  of  Mr,  C,  F,  Smith  of  the  AFQRLi 
Communications  Laboratory. 
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FIGURE  (0,  Spectrum  PeoK  Picker  Output  Representation 
Of  the  Spoken  Word  "ASK" 
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3.  1.2  Parameter  Quantization 


As  remarked  in  Section  2.  2,  quantization  of  parameter  values 
Gonsitutes  an  integral  part  of  the  way  in  which  a  parameter  is  definedi 
Unfortunately,  however,  time  and  equipment  limitations  have  prevented 
a  thorough  study  of  the  effects  of  several  different  quantizations  which 
would.be  reasonable  for  the  parameters  studied.  SpecifiGally,  the  way 
in  which  each  parameter  has  been  quantized  for  experiments  reported 
for  this  study  is  indicated  in  Table  5.  The  extractor  for  each  of  the  para* 
meters  E,  E  ,  and  was  designed  to  produce  outputs  between  0  and 
6  volts,  and  ^ke  eight  quantizing  levels  were  set  at  0.  75  volt  intervals,  so 
that  a  linear  quantization  of  these  five  parameters  was  obtained.  While 
this  representation  is  quite  reasonable  for  E,  £  ,  M^,  and  it  is  not 
satisfactory  for  Significantly  better  resolulion  would  be  obtained  for 
this  latter  parameter  with  logaritteiie  spacing  of  quantizing  levels  (as 
indicated  in  Figure  9).  However,  since  only  one  A/D  converter  was 
available  in  time  for  use  on  this  project,  a  compromise  was  made  in 
favor  of  the  linear  spacing.  The  effects  of  different  q’aantization  of  the 
spectrum  moments  are  discussed  in  Section  5  in  the  light  of  experimental 
results  reported  in  the  remainder  of  this  section. 


TABLE  5.  SPEEGM  PARAMETER  QUANTlEATIOH 


P  aram.etej 


Maximum  Number  of  Different 
Parameter  Values 


Symbol 

Description  Nu 

mber 

V 

Spectral  peak  pattern  ^ 

Voicing  indication 

2^' 

2 

E 

Input  speech  amplitude 

8 

E 

Normalized  speech  amplitude 

8 

Spectrum  area 

8 

^2 

Spectrum  First  Moment 

8 

Spectrum  Second  Moment 

8 

Bits 


13 

1 

3 

3 

3 

3 

3 


3»  2  PISTRiBUTION  OF  SPEECH  SOUNDS  IN  PARAMETER  SPACE 

To  ascertain  the  d'istrihution  of  speech  sounds  in  the  parameter 
spaces  Created  hy  comhinations  of  the  parameters  listed  in  Table  3, 
data  has  been  collected  from  three  speakers  using  the  equipment  des^ 
cribed  in  the  preceding  sectiom  Results  are  reported  here  for  the 
eleven  vowel  sounds  listed  in  Table  !•  To  obtain  a  representative  set 
of  speech  samples  of  these  sounds,  the  following  procedure  has  been  followed 
for  each  speaker. 

The  speaker  was  asked  to  read  a  word  list  consisting  of  eleven 
words.  Each  word  on  the  list  was  chosen  so  that  one  of  the  eleven  vowel 
sounds  would  be  spoken  during  the  utterance  of  the  word,  if  the  word  were 
pronounced  "properly".  To  obtain  a  reasonably  large  number  of  independent 
samples  of  sounds  within  a  varying  environmentj  each  speaker  was  pre* 
sented  with  3  different  word  lists,  at  5  different  times,  within  an  interval  of 
several  days.  The  three  word  lists  are  contained  in  Table  6,  This  pros 
cedure  produced  a  magnetic  tape  recording  of  If  utterances  of  each  of  the 
11  vowel  sounds  by  each  of  the  three  speakers  49f  utterances  in  all. 

The  next  step  consisted  of  playing  back  the  recorded  word  lists 
into  the  spee<^h  processing  equipment.  This  resulted  in  a  sequence  of 
sample  patterns  of  parameter  values  representing  the  speech,  stored  in 
the  Recomp  computer.  This  sequence  of  patterns  was  then  typed  out,  and 
the  intervals  corresponding  to  the  vowel  sounds  identified.  A  human  observer 
made  the  identification  while  listening  to  the  original  speech  recording, 
and  using  the  following  rough  guidelines: 

a)  Use  only  those  patterns  which  occur  within  intervals  of 
speech  comprising  readily  identifiable  sounds. 

b)  Use  primarily  those  patterns  which  either  persist  over 
several  samples  or  which  change  slowly  over  an  interval  of 
vinGhanging  sound. 

The  typed  representation  of  this  sequence  of  patterns  and  a  typical 
assignment  of  patterns  to  a  sound  are  illustrated  in  Table  7,  using  the 
word  "Neck",  and  the  interval  associated  with  the  sound  EH  (e ),  Inter® 
pretations  of  the  binary  representation  of  the  parameters  are  given  in 
Table  8.  All  parameters  have  been  represented  in  binary  form  for  eon* 
venience  only  ®®  octal  and  decimal  representations  are  also  available 
with  the  Recomp  computer. 
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TABLE  6  -  THREE  WORD  LISTS  EMPLOYED  FOR  VOWEL  SOUNDS 
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Table  7 


The  rernaihing  step  in  the  proGessing  6f  speeGh  data  to  obtain 
an  estinnate  of  the  distribution  of  sounds  in  parafneter  space,  consists 
of  listing  each  different  pattern  selected  by  the  above  procedure,  along 
with  the  nurnber  of  tinoes  the  pattern  oceurred  within  each  sounds  From 
such  a  histogram  the  regions  in  parameter  space  corresponding  to  each 
sound,  and  the  overlap  between  these  regions  can  be  ascertained« 

3.  7..  1  Parameter  Space  Usage 

The  number  of  sample  patterns  processed  to  obtain  a  picture 
of  the  distribution  of  vowel  sounds,  and  the  number  of  different  patterns 
which  arose  from  each  sound  individually,  and  from  all  vowel  sounds, 
are  given  in  'Table  9^  As  indicated  in  this  table,  most  of  the  data 
processing  has  been  performed  for  the  two  parameter  spaces  formed 
by  (1)  considering  the  location  of  spectral  peaks  alone,  and  (2)  consi¬ 
dering  spectral  peaks  and  the  first  three  spectrum  moments:  Mq, 

Mj  and  These  two  spaces  will  be  called  "peak  space"  and  "peak* 
moment  space",  respectively.  Since  60  speech  samples  are  extracted 
each  second,  we  see  from  this  table  that  a  total  of  approximately  28,  23, 
and  seconds  of  spoken  vowel  sounds  were  processed  to  obtain  the 
data  for  the  three  speakers.  Thus,  an  average  of  approximately  2.  § 
Seconds  of  each  sound  was  obtained  for  each  speaker.  Since  each  sound 
was  uttered  15  times  by  each  speaker,  an  average  of  approximately  10 
sample  patterns  were  obtained  from  each  utterance  of  a  vowel  sound. 

Perhaps  the  first  question  which  arises  in  considering  such  a 
collection  of  data  is  whether  enough  samples  have  been  obtained  to 
warrant  acceptance  of  the  distribution  of  sounds  in  parameter  space 
provided  by  the  samples,  as  an  accurate  estimate  of  the  distribution  which 
would  be  observed  if  an  unrestricted  number  of  speech  samples  were 
processed.  In  an  attempt  to  obtain  at  least  a  partial  answer  to  this 
question  it  has  been  conjectured  that  the  numbe,r  of  different  speech 
parameter  patterns,  N  ,  which  occur  within  an  interval  of  speech,  T, 
tends  to  vary  according  to  a  parametric  space  usage  curve: 

N  =  N  [  1  *  e"**^  "] 
p  o  ‘  ^ 
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where  N  is  softie  niinri'ber  which  is  less  than  or  equal  to  the  total  nuftiber 
of  patterns  which  can  possibly  occur,  a  is  a  constant  reflecting  the  rate 
at  which  the  number  of  different  patterns  encountered  grows  as  the  length 
of  speech  observation  interval  is  increased,  A  is  the  sampling  interval^ 
and  n  is  the  number  of  samples  obtained  in  the  intervali  T.  The  quantities 
and  nr  are  determined  by  the  parameter  space  and  the  speech  source» 

A  thorough  exploration  of  the  parameter  space  usage  curve  for 
the  spaces  formed  by  the  speech  parameters  considered  during  this 
study  has  not  been  possible.  However,  based  on  a  few  points  obtained 
for  a  single  speaker,  the  two  values  =  400  and  a  =  .  05  provide  a 
reasonably  good  fit  for  the  generation  of  spectral  patterns  alone,  i.  e. , 
for  the  parameter  space  formed  by  positions  of  spectral  peaks.  Since 
the  total  number  of  spectral  peak  patterns  which  can  ever  occur  has  been 
found  to  be  approximately  2^^,  it  appears  that  not  much  more  than  five  per¬ 
cent  of  the  points  in  this  parameter  space  would  ever  be  used  by  vowel  sounds, 
no  matter  how  long  an  interval  Of  speech  is  considered.  A  further  in¬ 
dication  provided  by  =  400  and  dt  =  .  05  is  that  at  least  75  percent  of  all 
sample  patterns  produced  by  vowel  sounds  would  be  matched  by  the  308 
peak  patterns  generated  by  the  1677  samples  taken  from  speaker  number 
one  (Table  9). 

Although  great  reliance  should  not  be  placed  on  the  parameter 
space  usage  curve  until  further  study  is  performed,  it  is  encouraging 
to  note  that  $1  percent  of  the  test  samples  processed  for  speaker  number 
one  in  the  transcription  experiments  (described  in  Section  4)  matched  one 
of  the  308  patterns  generated  by  the  vowel  sound  data  -  «  a  di  screpancy 
of  six  percent  between  observed  and  predicted  relative  frequency  of  matching, 

in  the  interest  of  avoiding  undue  bulk,  the  complete  histogram  for 
each  sound  has  not  been  included  in  this  report,  Rather,  a  few  salient 
characteristics  of  the  sound  distributions  are  described.  However,  as 
an  indication  of  the  .type  of  distribution  obtained,  the  complete  distri-' 
bution  of  a  single  sovuid  for  a  single  speaker  has  been  in^ded  in  Table 
10  for  peak  space,  i,  e. ,  the  parameter  space  formed  by  spectral  peaks 
alone. 
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TABLE  10.  GOMPLETE  HISTOGRAM  FOR  THE  SOUND  EE(i)  IN  PEAK 

SPACE  (SPEAKER  NUMlER  ONE) 
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The  high  fre^uenGy  of  ocGurrenGe  6f  some  patterns  within  a 
sound  suggests  that  a  large  perGentage  of  all  speeGh  aamples  rnight 
Gorrespond  to  a  very  Small  number  of  patterns  of  parameter  values^ 

As  an  indiGation  of  both  the  region  in  parameter  space  oGCupied  by 
the  vowel  soundSi  and  the  degree  to  whieh  SpeeGh  may  be  represented 
by  a  small  number  of  patterns,  the  ten  most  frequently  oeeurring 
patterns  have  been  listed  in  Appendix  II  for  each  of  the  vowel  sounds, 
in  peak  space  and  peakamoment  space  for  speaker  number  ©nei  The 
Goverage  of  all  samples  obtained  from  the  vowel  data  provided  by 
these  no  patternSi  is  smnmarized  in  Table  11  for  eaGh  of  the  three 
Speakers, 

Asa  final  indication  of  parameter  space  usagei  the  distria 
bution  of  speech  sample  spectra  with  respeet  to  the  number  of  spectral 
peaks  is  shown  in  Figure  11,  The  graphs  in  this  figure  indicate  that  the 
majority  of  speech  samples  (of  vowel  sounds)  produce  spectra  with 
three  or  four  peaks.  Furtherj  the  relative  frequency  of  occurrenGe  of 
a  given  number  . of  peaks  appears  to  be  approximately  the  same  for  the 
three  speakers. 

in  the  course  of  this  study,  some  thought  has  been  directed  to 
the  question  of  whether  the  parameter  space  usage  can  be  reduced  (vdthout 
increasing  overlap  between  speech  sounds)  by  some  sort  of  "warping” 
performed  after  patterns  of  parameter  valuds  are  obtained.  One  technique 
for  attempting  to  accomplish  this  reduction  has  been  investigated  for 
peak  space.  Specifically,  it  has  been  conjectured  that  the  most  important 
characteristic  of  formant  positions  consists  of  the  ratio  of  the  second  and 
highef  formant  frequencies  to  the  first  formant  frequency.  If  this  is  the 
case,  and  if  the  vocoder  filters  are  logarithmically  spaced,  then  allowable 
variations  in  formant  positions  representing  a  given  speech  sound  would 
consist  of  "rigid"  shifts  of  the  peak*picked  spectra.  To  test  this  idea,  a 
program  has  been  written  which  maps  peakspieked  spectra  into  a  subset 
as  determined  in  the  following  way.  The  first  sample  oeeurring  in  a 
speech  recording  is  installed  as  the  initial  member  of  an  "intermediate 
reference  library".  Bach  successive  new  spectrum  arising  in  the  speech 
recording  is  compared  with  each  of  the  spectra  in  this  library,  If  a  new 
spectrum  meets  a  criterion  of  "closeness"  to  any  one  of  the  members  of 
the  library,  then  the  new  spectrum  is  associated  with  that  reference  library 
member.  If  the  criterion  is  not  met  for  any  of  the  library  members, 
then  the  new  spectrum  is  installed  as  a  new  member  of  the  library,  ’  A 
criterion  of  closeness  based  on  the  notion  that  slight,  rigid  shifts  in 
formant  positions  are  allowable,  has  been  tested  using  the  Recomp  II 
computer,  A  description  of  the  program  is  given  in  Appendix  III  These 
tests  indicate  that  while  the  number  of  different  library  members  tends  to 
level  off  at  a  few  hundred  as  speech  samples  are  processed,  more  work 
must  be  done  to  relate  the  criterion  of  closeness  to  the  speech  sounds 
themselves. 
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Fraction  of  Vowel  Samples  Covered'  by  10'  Most  F 


Spea'ker  No.  3  Speaker  No.  2  Speaker  No. 


Overlap  Between  Speech  Sounds  in  Parameter  Space 

As  disGussed  iP  Section  2^  3,  it  is  desired  that  a  parameter  space 
be  Gonstructed  in  sueh  a  way  that  any  given  pattern  of  parameter  values 
always  arises  from  the  same  sounds  If  this  goal  is  achieved,  then 
parameter  space  may  be  partitioned  into  non-ii overlapping  regions  with 
each  region  correspoinding  to  exactly  one  speech  sound.  However, 
as  is  the  case  with  ^  .any  other  pattern  recognition  problems  exhibiting 
wide  variations  in  lanifestatinn  of  the  classes  involved,  cpmpletg  absehce 
of  overlap  between  speech  sounds  (or  any  other  language-elements)  will 
very  likely  never  be  attained  with  any  parameter  space. 

The  degree  to  which  a  given  set  of  parameters  can  be  expected 
to  provide  adequate  separation  of  speech  sounds  can  be  estimated  in 
several  ways»  One  of  the  more  informative  ways  would  be  to  calculate 
the  probability  that  any  given  speech  sound  will  be  designated  as  one  of 
the  other  speech  sounds,  when  parameter  space  is  partitioned  in  a  way 
which  tends  to  minimize  this  quantity.  As  discussed  in  Section  2.  3,  many 
methods  exist  by  which  such  a  partitioning  can  be  approximated,  If  estimates 
of  the  probabilities  {P.  (  x.  |S»)}  are  available  then  the  maximum  likelihood 
method  of  partitioning  can  be  applied  with  these  estimates,  We  have  applied 
this  method  of  partitioning  parameter  space,  using  the  histograms  for 
each  sound  as  estimates  of  {P^ijc  ISi)}*  Letting  v^j^ denote  the  k»th 
distinct  pattern  of  parameter  values  produced  by  samples  of  all  speech 
sounds  (limited  to  vowels  for  the  data  reported  here),  the  probability 
that  a  single  sample  of  speech  corresponding  to  the  i^th  speech  sound  will 
be  associated  with  the  j^th  speech  sound,  qr^j,  can  be  written  (for  i  ^  j) 

n 

a.. 

ij 

"  k=l 


»  i  T  8,.  r.' 

n,  L  ki  ij 


where  *  the  number  of  occurrences  of  the  pattern  within 

intervals  of  speech  corresponding  to  the  i-^^h  speech  sound, 


1  if 


otherwise 


=52- 


ft. 

1 


the  number  of  samples  arising  from  the  isth  Speech  SOUftd 


aftd 


n  = 


the  total  number  of  available  speeGh  samples < 


Using  the  data  collected  for  the  eleven  vowel  Sounds,  the  {ftij)  matrices 
are  given  ift  Tables  12,  13,  and  14  for  three  speakers,  and  for  peak 
space  and  peak*moment  space.  The  diagonal  elements  in  these  matrices, 
indicate  the  estimated  probability  of  correct  Glassification.  In 
some  cases,  the  most  likely  confusions  occur  between  speech  sounds 
which  are  often  interchanged  by  speakers.  Many  people,  for  instance, 
use  A  instead  of  EM  for  the  first  vowel  sound  in  the  word  "HELLO". 

The  result  of  interchanging  these  sounds  would  be  a  speech  transcription 
with  a  distorted  accent.  For  non  vowel  sounds,  of  course,  interchanges 
can  produce  more  detrimental  results. 


As  with  parameter  space  usage,  the  adequacy  of  the  number  of  available 
samples  may  be  questioned  when  considering  such  overlap  matHces. 
Although  such  devices  aS  the  parameter  space  usage  curve  introduced 
earlier  may  be  employed  in  an  attempt  to  answer  this  question,  time  has 
not  allowed  for  this  type  of  study  during  this  project. 

The  pairwise  overlaps  indicated  by  the  {ffij}  matrices,  do  not 
show  the  extent  to  which  more  than  two  speech  sounds  ever  give  rise 
to  the  same  pattern  of  parameter  values.  The  number  of  different 
patterns  with  p  spectral  peaks  which  ever  occur  within  intervals  of 
speech  corresponding  to  k  sounds  is  indicated  in  Table  15  for  each  of 
three  speakers  and  for  peak  space.  The  entries  in  this  table,  coupled 
with  the  {®y}  matrices  suggest  that  although  approximately  30  percent 
of  the  different  patterns  in  peak  space  occur,  at  different  times.  Within 
intervals  of  speech  corresponding  to  different  sounds,,  most  occurrences 
of  these  patterns  are  associated  with  a  single  sound. 


As  a  final  indication  of  the  degree  to  which  the  vowel  sounds  pro» 
duce  different  patterns  in  peak  space,  we  have  constructed  bar  graphs, 
called  spectral  profiles,  which  show  the  peFcentage  of  iample  patterns 
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Table  12,  Estimated  Relative  Frequency  of  Correct  and  Misclassification 
of  Vowel  Sounds  for  Speaker  Nurnber  One»  and  Two  Parameter 
Spaces, 
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Table  13.  Estimated  Relative  Frequency  of  Correct  and  Misclassification 
of  Vowel  Sounds  for  Speaker  Number  Two.  and  Two  Parameter 
Spaces, 
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of  Vowel  Sounds  for  Speaker  Number  Three,  and  Two  Parameter 
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faible  15.  Number  of  Different  Patterns  with  p  Peaks  which 


observed  for  a  given  speech  sOmnd,  which  possessed  a  peak  in  the  k“th 
channel,  k  -  1,  2,  .  .  ^  .  18.  These  profiles  for  the  vowel  sounds  and  the 
three  speakers  are  shown  in  Figures  12,  13,  and  14.  Although  infor- 
rnation  concerning  combinations  of  peaks  is  not  contained  in  these  graphs, 
in  some  cases  the  locations  of  formants  can  be  inferred.  In  the  sound 
EE,  for  instance,  the  sum  of  samples  containing  a  peak  in  either  channels 
1  or  2  aecounts  for  essentially  all  samples;  this  is  true  for  all  three 
speakers.  This  verifies  the  well-known  fact  that  EE  produces  a  first 
formant  in  the  frequency  range  200  =  400  eps. 

One  is  easily  tempted  to  draw  conclusions  from  the  spectrai  pro¬ 
files  other  than  simply  the  approximate  locations  of  formants  as  indicated 
by  the  peak  picker.  In  the  sound  EE,  for  instance,  the  relative  frequency 
of  occurrence  of  a  peak  in  the  first  channel  may  be  used  as  an  estimate 
of  the  likelihood  that  an  utterance  of  this  sound  will  produce  a  peak  in 
that  channel.  Also,  interpolation  using  the  ehannel  weightings  indicated 
in  the  spectral  profiles  might  be  expected  to  provide  a  more  accurate 
estimate  of  the  formant  locations  for  a  given  speaker.  Further,  in  some 
cases  (for  instance  the  second  formant  in  the  sound  00),  the  sum  of  all 
weightings  in  a  Short  frequency  interval  spanning  no  more  than  two  Or 
three  channels,  and  surrounded  by  channels  with  all-zero  weightings, 
may  provide  an  indication  of  formant  strength. 

From  the  data  obtained  on  vowel  sounds,  it  can  be  concluded  that 
peak-space  provides  a  means  of  representation  which  achieves  adequate 
separation  of  all  vowel  sounds  except  those  sounds  which  are  perhaps 
the  most  difficult  for  a  human  to  distinguish  between.  In  peak-moment 
space,  further,  but  not  complete,  separation  is  achieved.  It  is  very 
likely  that  the  reason  for  ineomplete  separation  of  vowel  sounds  in  peak- 
moment  space  is  due  solely  to  the  way  in  which  the  parameters  Mj  and 
have  been  quantized.  If  these  parameters  are  quantized  properly 
(as  indicated  in  Figure  9),  then  essentially  complete  Separation  of 
vowel  sounds  is  expected,  and  further,  the  peak-moment  parameter 
space  usage  could  very  easily  be  less  than  that  reported  in  Section  3.  2.  1 
above. 
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FIGURE  13  r  Spectral  Peak 
Profiles  for  Vowel  Soands 
SPEAKER  No.  2 
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4.  SPEECH  TRANSCRIPTION  AND  WORD  RECOGNITiON  TECHNIQUES 

In  this  section,  techniques  for  Conipleting  the  final  transformation, 
i.  e.  from  parameter  space  to  the  space  consisting  of  language  elements^  are 
discussed-  The  basic  language  element  used  to  represent  speech  in  this  study 
is  a  speech  "sound",  as  described  in  Section  Z'.  1.  In  addition  to  methods  of 
transforming  patterns  of  parameter  values  into  these  speech  sounds^  tech¬ 
niques  for  recognizing  spoken  words  as  sequences  of  speech  sounds  are  dis¬ 
cussed.  Illustrative  examples  of  transcription  quality  obtainable  with  the 
simplest  of  these  methods  using  a  minimal  parameter  space  (spectral  peaks 
alone)  are  included  at  the  end  of  this  section. 

4. 1  SPEECH  TRANSCRIPTION  METHODS 

It  has  been  suggested  in  Section  2.  3  that  of  all  the  different  ways 
that  sample  patterns  of  parameter  values  could  be  associated  with  language 
elements^  the  maximum  likelihood  method  (using  histograms  as  estimates 
of  probability  distributions)  seems  to  offer  the  greatest  potential  for  good 
performance,  if  the  number  of  speech  samples  used  in  constructing  the 
histograms  is  large  enough.  Assuming  that  enough  samples  can  be  obtained 
(as  is  possible  with  the  eqiipment  deccrihed  in  Section  3. 1),  it  might  be 
concluded  that  only  one  step  remains  to  complete  the  process  of  automatic 
transcription.  This  consists  of  implementing  the  table  look-up  operation 
dictated  by  the  decision  boundaries  resulting  from  this  maximum  likelihood 
method  of  partitioning  parameter  space  into  nonoverlapping  regions  corres¬ 
ponding  to  different  speech  sounds.  If  essentially  no  overlap  occurs  in  para¬ 
meter  space  between  speech  soundst  this  cpnclUSion  is  correct.  To  produce 
a  sequence  of  sound  symbols  representing  speech,  it  is  only  necessary  that 
each  sample  pattern  of  parameter  values  be  compared  with  a  collection  of 
labeled  patterns  (called  a  "reference  library"),  and  type  the  label  corres¬ 
ponding  to  the  reference  pattern  which  is  matched  by  the  sample.  Since 
most  speech  sounds  (as  defined  on  this  project)  span  several  speech  samples, 
the  occasional  occurrence  of  no  match  between  a  single  incoming  sample 
pattern  and  any  of  the  reference  patterns  will  produce  no  significant  loss  of 
information!  One  way  of  handling  no-mateh  decisions  is  to  produce  a  standard 
symboli  say  "Y",  indicating  this  facti  another  option  is  to  print  out  nothing 
for  the  no-match  decisions.  An  idealized  transcription  of  the  word  "THREE", 
using  the  latter  option  with  the  rudimentary  exact  match  method  applied  to 
each  speech  sample >  would  be: 


in  producing  this  traiisGription  consisting  of  IS  sound  Symbols,  perhaps  25 
speech  samples  might  have  been  prOGeSsed,  with  10  no-match  decisions 
withih  the  word.  Although  it  may  be  desired  that  Some  indication  be  re¬ 
tained  of  the  time  intervals  spanned  by  speech  sounds  (Gonceivably  to 
identify  the  speaker  by  recreating  an  acGent),  it  is  anticipated  that  the  most 
Gompaet  presentation  would  be  desired  for  most  applications.  This  can  be 
achieved  by  modifying  the  rudimentary  exact  match  method  by  printing  out 
a  sound  symbol  only  if  it  is  different  from  the  preceding  symbol.  This  mod¬ 
ification  produces  "TM  tJR  EE"  for  the  above  examplei  As  with  no-mateh 
decisions,  any  one  of  several  methods  can  be  employed  to  indicate  samples 
taken  during  silencet  or  no-speech  sounds,  either  indicating  or  not  indicating 
the  duration  of  such  intervals. 


If  overlap  exists  between  speech  sounds  in  parameter  space  (i.  e. ,  if 
it  is  likely  that  a  sizeable  percentage  of  speech  samples  will  be  misclassified), 
then  the  rudimentary  exact  match  method  will  produce  a  distorted  transcriptibn 
Two  avenues  exist  by  which  Such  a  Situation  could  be  improved:  (1)  additional 
parameters  can  be  extracted  from  speech  signals,  and  (2)  the  way  in  which 
decisions  are  reached  can  be  changed.  The  first  approach  is  straightforward, 
Augmentation  of  peak-space  with  spectral  moments,  for  instance,  produces 
less  overlap  between  vowel  sounds,  as  indicated  in  the  tables  at  the  end  of 
Section  3.  As  soon  as  enough  parameters  are  available  to  produce  separated 
speech  sounds  in  parameter  space,  then  the  exact  match  transGription  naethod 
may  be  employed  as  described  above. 


If  it  happens  that  not  enough  parameters  can  be  used  to  achieve 
separation  of  speech  sounds  (without  exceeding  storage  limitations,  for  in¬ 
stance),  then  the  method  of  associating  patterns  of  parameter  values  with 
speech  sounds  can  be  changed.  The  reason  why  there  exists  room  for  im¬ 
provement  over  the  single  sample  exact  match  method  based  on  maximum 
likelihood  is  simply  that  sometimes  several  speech  samples  are  taken  within 
the  intervals  of  speech  corresponding  to  utterances  of  single  speech  sounds. 

It  is  therefore  not  necessary  to  render  a  decision  for  each  speech  sample. 

If  some  method  is  devised  for  segmenting  speech  into  intervals  corresponding 
to  utterances  of  speech  sounds,  then  all  of  the  speech  samples  taken  within 
each  interval  could  be  combined  to  produce  a  more  reliable  decision, 


In  any  pursuit  of  this  course  for  improving  speech  transcription 
quality,  several  methods  of  combining  speech  samples  deserve  investigation, 
Perhaps  the  moat  straightforward  method  consists  of  observing  the  sequence 
of  sounds  occurring  in  an  interval  (as  determined  by  the  rudimentary  exact 
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match  method  of  labeling  each  s,peech  sample),  and  assoGiating  the  speech  segment 
with  that  sound  which  has  OGCurred  most  ftequently  in  the  segmenti  This  "plurality 
rule"  method  would  require  a  relatively  simple  augmentation  of  the  equipment  re“ 
quired  for  the  rudimentary  exact  match  method  alonei  Another  technique  consists 
of  selecting  one  of  the  samples  OGcurring  within  the  speech  segment  as  a  repre* 
sentative  for  that  segment,  and  choosing  the  sound  to  which  the  representative  is 
associated,  with  the  maximum  likelihood  method  applied  to  single  samples.  Two 
ways  to.  select  such  a  representative  are  (a)  the  segment  midpoint,  and  (b)  the 
sample  occtirring  nearest  to  the  point  in  the  segment  at  which  the  speech  signal 
is  judged  (by  some  operation)  to  be  changing  the  least.  The  latter  (quiescent) 
sample  could  be  selected  in  a  variety  of  ways. 


When  only  one  decision  is  to  be  rendered  for  each  speech  segment,  it  is 
also  possible  to  combine  the  sequence  of  speech  samples  spanned  by  the  segment 
to  produce  a  single,  derived  parameter  value  which  would  represent  the  combin¬ 
ation.  This  kind  of  operation  would  produce  what  might  be  called  a  derived  para¬ 
meter  space  consisting  of  a  small  number  of  dimensions .  As  an  illustration  of 
this  approach,  consider  the  sequence  of  s  spectral  peak  patterns,  =  (p.^> 
p.^, . .  i .  p;  iq),  i  "  1,  2, . . . ,  s,  corresponding  to  a  given  speech  segment.  These 


samples  may  "be  combined  to  produce  a  shagle  pattern,  =  (u^,  u^, 
aceord.ng  to  the  formula; 


“18>' 


o 

T 

J  8  4/ 


ij 


j  -  1,  2 .  18. 


:1 


The  quantity  u.  reflects  the  pereentage  of  speech  samples  (with  the  given  segment) 
which  have  a  p*eak  in  the  j^th  frequency  channel.  If  a  sound  produces  mostly  re* 
petitions  of  the  same  peak  pattern  within  a  speech  segment,  then  will  be  essenB 
tially  identical  with  this  peak  pattern.  If,  on  the  other  hand,  a  speech  sound  is 
characterized  by  slight  changes  in  peak  patterns  within  a  speech  segment,  then 
will  consist  of  some  components  which  are  less  than  one,  but  greater  than 
zero.  The  amount  and  nature  of  the  change  in  peak  positions  within  the  speech 
segment  will  be  reflected  by  the  shape  of  the  pattern,  u. 


For  any  given  method  of  eombining  speech  samples  with  a  speech 
segment,  there  exist  many  methods  of  associating  the  resulting  pattern 
with  speech  sounds.  As  with  individual  speech  sampleis,  the  method  of 
maximum  likelihood  using  histograms  as  estimates  of  the  distribution  of 
speech  sounds  in  the  derived  parameter  space,  would  probably  provide  the 
most  accurate  association.  However,  it  is  quite  possible  that  the  large 
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numfeer  of  different  patterns  of  derived  parameter  values  which  ean  result 
from  utteranGeS  of  speeGh  sounds  will  preGlude  the  colleGtiOn  of  enough 
samples  of  speedr  to  warrant  the  use  of  this  method.  It  is  likelyi  therefore, 
that  one  of  the  other  noethods  described  in  Section  2.  3  would  have  to  be  relied 
uponi  With  the  method  of  eombiniag  peak  patterns  described  above,  for  in¬ 
stance,  the  spectral  profiles  (Figures  12,  13,  and  14)  can  be  retarded  as 
representatives  of  the  speech  sounds,  and  correlation  between  u  and  a  given 
spectral  profile  would  provide  an  Indication  of  "closeness"  between  the  speech 
Segment  and  the  sound  corresponding  to  the  profile.  The  segment  would  be 
associated  with  the  speech  sound  whose  corresponding  spectral  profile  pro¬ 
duces  the  highest  correlation  with 

In  order  to  implement  any  of  these  methods  for  combining  several 
speech  samples,  a  method  of  segmenting  speech  must  be  devised  From  the 
little  study  of  the  parameter  AS  (Table  3)  which  could  be  conducted  after  its 
extraction  was  automatized  toward  the  end  of  this  project,  it  appears  that 
speech  signals  can  be  partitioned  into  time  intervals  roughly  Gorresponding 
to  SpeeGh  sounds  by  thresholding  this  quantity.  As  illustrated  in  Table  4, 
transitions  between  Speech  sounds  can  also  be  identified, 

Before  the  efficacy  of  this  or  any  other  speech  segmentation  method 
can  be  ascertained,  many  experiments  must  be  carried  out  using  several  of 
the  more  promising  methods  of  combining  speeGh  samples.  Time  did  not 
allow  for  such  experimentatiQn  during  this  project.  However,  the  exact  match 
transcription  method  was  programmed  for  simulation  on  the  ReCOmp  II  com* 
puter,  and  several  tests  have  been  Conducted.  As  described  above,  the  rudi¬ 
mentary  exact  match  method  produces  either  a  single  phonetic  symbol,  or  no 
symbol  if  a  sample  does  not  match  any  of  the  patterns  stored  in  the  reference 
library.  Although  tests  Were  Gondueted  for  both  peak  space  and  peak-moment 
space,  the  number  of  no-match  decisions  obtained  for  the  latter  space  pre¬ 
cluded  extensive  study.  The  reasons  for  the  large  number  of  no-match 
decisions  with  the  speetral  moment  paiameters  are  twofold,  First,  instead 
of  normalizing  and  M2  with  respect  to  M^.  all  three  quantities  were  ex¬ 
tracted  and  quantized  se^  rately  The  quantization  Of  thus  produced 
considerable  unnecessary  variation  in  the  measure  of  speetral  spread,  a. 
Perhaps  even  more  detrimental  to  proper  extraction  of  moments,  M2  was 
quantized  linearly,  rather  than  logarithmically,  thus  producing  high  resolution 
for  unvoiced  spectra,  but  very  coarse  resolution  for  voiced  spectra.  Both  of 
these  problems  were  foreseen  (and  are  easily  remedied  through  the  addition  of 
modified  analogue -to -dgital  eonverters  in  the  experimental  speech  processing 
equipment),  but  could  not  be  avoided  within  the  time  span  of  this  project,  The 
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transGriptions  have  therefore  been  conducted  primarily  for  peak  space  Only. 

For  this  parameter  space,  the  random  arrangement  of  errors  in  transcriptions 
using  this  method  (with  reference  libraries  construeted  from  data  described 
in  Section  3*),  produced  relatively  long  sequences  of  symbols.  Although  the 
correct  speech  sounds  were  represented  more  frequently  than  incorreet  soundSj 
considerably  study  is  required  to  make  the  identification.  However,  obser^ 
vation  of  the  general  persistance  of  the  correct  sounds  over  several  samples 
Suggested  another  option  far  ’’smoothing"  the  sequence  of  sounds,  and  eombining 
several  samples  to  produce  a  reduced  number  of  symbols,  instead  of  typing 
out  a  single  symbol  for  the  most  likely  speech  sound,  if  the  two  or  three  most 
likely  sounds  associated  with  each  speech  san^ple  are  selected  as  •.tentative 
candidate 8, and  ambiguities  are  resolyed  in  favor  of  sounds  which  persist  as 
candidates  over  the  largest  number  of  samples,  then  fairly  readable  trans¬ 
criptions  are  obtained.  Specifically,  the  following  procedure  for  processing 
rudimentary  exact  match  transcriptions  (with  up  to  three  candidates  for  each 
speech  sample)  has  been  followed: 

(1)  Print  out  a  sound  symbol  only  if  the  same  sound  is  recognized 
on  two  successive  samples. 

(2)  Repeat  a  sound  symbol  for  every  successivelirdiaGent  pair  of 
occurrences  of  the  sound. 

(3)  Ambiguities  are  resolved  in  favor  of  the  sound  which  either 
(a)  has  occurred  on  the  previous  sample,,  or  (b^  persists  the 
longest  without  interruption. 

(4)  Symbols  fol*  unvoiced  sounds  are  inserted  properly. 

Examples  of  the  resulting  transcriptions  obtained  with  this  exact 
match  and  smoothing  method**  are  shown  in  Tables  16,  17,  and  18,  for  three 
different  speakers.  The  rudimentary- transcriptions  were  performed  on  the 
Recomp  computer,  and  the  smoothing  operations  were  completed, by  hand. 


*For  peak  space,  as  indicated  in  Table  9,  the  libraries  consisted  of  308 
-patterns  for  Speaker  Number  One,  343  patterns  for  Speaker  Number  Two, 
and  186  patterns  for  Speaker  Number  Three. 

**From  utterances  of  a  test  word  list  (Table  19). 
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EXACT  MATCH  TRANSCRIPTION  OF  TEST  WORD  LIST  FOR 

speaker  number  One  (peak -space) 


TABLE  16. 

STANDARD  TRANSCRIPTION 

Z  EE  UR  0 
00  0  AH  N 
T  U  00  00  00 
T.H  UR  UR  EE  EE  EE 
F  O  0  O  UR 
F  AH  A  EH  V 
S  I  I  K  S 
S  EH  V  U  N 
EH  EE  f 
N  AH  EM  1  N 
PL  UH  UH  i 
M  AM  EH  N  U  $ 

T  AM  EM  1  M  $ 

P  UR  I  N  T 
EE  K  00  U  L  S 
ST  AM  UH  P 
P  AW  I  N  T 
ST  AH  UR  T 
A  AW  L  F  U 
B  EH  EE  T  U 
EH  EH  KS 
00  AH  A  EH  I 
Z  EE  EE  EE 
UR  I  P  EE  EE  T 
TH  UR  00  00  00 


AUTOMATIC  TRANSCRIPTION 

Z  II  UR  O  AM  AH  0 
00  00  U  U/0  U/0  N 
T  II  00  00 

TH  00  UR/0  EE  EE  EE  EE  EE  EE 
F  O  O  O  O  O  AW  UR/0 
F  UH  UH  UH  UH  UH  UH  UH  00  V 
SI  I  K  I 

i  l/UH/0  V  U/0  N 
II  EE  T 

N  0/U  UR  EE  EE  N 

P  L  O  AM  UMS 

M  U  1 1  EH  N  U  S 

f  EH  EH  U  1 1  M  S 

P  UR  l/EH  N  T 

EE  EE  EE  g  00  O  O  0  0  L  S 

ST  UH  UH  UH  P 

PO  AW  EE  N  T 

ST  UH  UR  f 

UH  UH  AW  L  F  U  UR 

B  EH/I  EH/I  T  EH/U/UH 

ah/aw/um  eh  KS 

00/UR"  UH  UH  UR  UH  UH  UH  AW/EH/I 
Z  EE  EE  EE  EE  EE  EE 
UR  UR  UR  P  EE  EE  EE  T 
TH  00  U  00  00  00 


Notes:  (1)  Interpolated  Sounds  are  Underlined. 

(2)  ah/eh  indicates  "either  AH  or  EH". 
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EXACT  MATCH  TRANSCRIPTION  OF  TEST  WORD  LIST  FOR 
SPEAKER  NUMBER  TWO  (PEAK-SPACE) 


TABLE  17. 

STANDARD  TRANSCRIPTION 

Z  EE  UR  O 
00  O  AH  N 
T  U  00  OO  00 
TH  UR  UR  EE  EE  EE 
F  O  O  O  UR 
F  AH  A  EH  V 
S  I  I  K  S 
S  EM  V  U  N 
EH  EE  T 
N  AH  EH  I  N 
PL  UH  UH  S 
M  AH  EH  N  U  S 
T  AH  EH  1  M  S 
P  UR  I  N  T 
EE  K  00  U  L  S 
ST  AH  UH  P 
P  AW  I  N  T 
S  T  AH  UR  T 
A  AW  L  F  U 
B  EH  EE  T  U 
eH  eh  KS 
00  AH  A  EH  I 
Z  EE  EE  EE 
UR  I  P  EE  EE  T 
TH  UR  00  00  00 

Notes 


AUTOMATIC  TRANSCRIPTION 

Z  EE  U  U  U  U  U 
UR  UR  AH  N 

T  U  00  00  00  00  00  00 
TH  U  U  EE  EE  EE  E 
F  O  O  AH  O  O 
F  UH  AH  AW  EH  EH  V 
J  1  11  I  K  S 
S  EM  EH  V  U  N 
I  I  EE  J 

N  uh/em/a  o  eh  eh  N 

P  L  AH  AH  UH  S 
M  AH/UH  N  EH  I  S 
f  A  EH  EH  M  I 
f  i  i  N  1 

EE  EE  K  U  U  U  U  U  L  I 
I  T  AH  AH  ah  P 
p  AW  0  I  N  T 

I  T  AW/0  AW/0  UR/EH/I  f 
AH  AH  L  F  O  O 
B  I/UR  EE  T  I  I 
EH  EH  K  S 

UR  UR  ah  ah  A  a  eh 
Z  EE  EE  EE  EE  EE  EE 
EE  P  EE  BE  EE  T 
TH  00  U  U  U  00 


(1)  Interpolated  Sounds  are  Underlined, 

(2)  ah/eh  indicates  "either  AH  or  EH" 
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f ABLE  18.  EXACT  MATCH  TRANSCRIPTION  OF  TEST  WORD  LIST  FOR 

speaker  number  three  (PEAK^SPAGE) 


STANDARD  TRANSCRIPTION  AUTOMATIC  TRANSCRIPTION 


Z  EE  UR  O 
00  0  AH  N 
T  U  00  00  00 
TH  UR  UR  EE  EE  EE 
F  0  O  O  UR 
F  AH  A  EH  V 
S  I  I  K  S 
s  EH  V  U  N 
EH  EE  f 
N  AH  EH  1  N 
PL  UH  UH  S 
M  AH  EH  N  U  S 
T  AH  EH  1  M  S 
P  UR  i  N  T 
EE  K  00  U  L  S 
$  T  AH  UH  P 
P  AW  I  N  T 
S  T  AH  UR  T 
A  AW  L  F  U 
B  EH  EE  T  U 
EH  EH  KS 
00  AH  A  EH  I 
Z  EE  EE  EE 
UR  I  P  EE  EE  T 
TH  UR  00  00  00 


Z  EH  UR  UR  UR  UR  0 
00  EH  AW  AW  A/EH/I  N 
T  EE  00 

TH  U  U  U  U  EE  EE  EE  EE  EE 
0/AW  0/AW  0/AW  0/AW  AW  AW 
F  AW  AW  AW  A  A  A  A  EH  EH  V 
1.1 1  I  K  S 
S  A  A  VEH  EH  EH  N 
EH/  l/UR  EH/I/UR  EE  EE  T 
N  EH  AW  AW  AW  AW  EH/L/UR  N 
P  L  AW  AW  AW  AW  S 
M  A  A  A  EH  EH  N  EH  S 
T  AWi  AW  AW  AW  EH  EH  EH  M  f 
P  UR  I  I  N  T 

EE  EE  EE  K  0  0  O  O  L  I 

S  T  AH/AW  AH/AW  AH/AW  P 

p  AW/EH  AW/EH  UR  UR  N  T 

i  f  EH  EH  A  A  A  A  T 

A/EH  A/EH  AH/AW  A/AW  L  F  AW  ( 

B  EH/I  EH/I  /UR  EE  T  EH  A  A 

EH  EH  EH/I/  UR  K 

00  AH  A  A  AW  AW  A  AW  I  I 

Z  EE  EE  EE  EE 

EE  EE  P  EE  EE  EE  T 

TH  U  U  00  00 


Notes:  (1)  Interpolated  Somds  are  Underlined. 

(2)  ah/eh  indicates  "either  AH  or  EH". 


The  entire  procedure  can  be  inetruniented  quite  easily. 


Also,  shown  in  Tables  16,17,  and  18  is  a  "standard"  transcription  of 

the  test  word  list  obtained  by  a  human  transGriber  after  listening  to  several 

utterances  of  the  words.  It  is  clear  that  many  different  transcriptions  would 

be  equally  acceptable  and  certainly  possible  as  a  result  of  variations  in 

accent,  as  well  as  variations  in  interpretation  by  observers. 

< 

The  sometimes  perfect  transcriptions  obtained  with  this  simple 
exact  match  transcription  method,  and  using  only  peak  patterns  as  the 
extracted  parameters,  suggests  strongly  that  the  addition  of  parameters 
reflecting  spectral  shape  would  produce  highly  readable  transcriptions  of 
all  vowel  sounds,  and  most  voiced  sovmds. 

4.  2  WORD  RECOGNITION  METHODS 

Although  automatic  transcription  of  speech  into  sequences  of 
phonetic  elements  does  not  necessarily  involve  words  as  language  elements 
at  all.  the  possibility  of  using  a  speech  transcriber  for  voice  control  of 
machines  suggests  that  word  recognition  tests  may  afford  a  reasonable 
method  of  evaluating  speech  transcription  methods.  Although  word  recogs 
nition  tests  inherently  involve  not  only  the  transcription  methods,  but  the 
word  recognition  methods  as  well,  we  have  adopted  this  med»od*-as  was 
suggested  by  the  procuring  agency. 

To  maximize  the  probability  of  correctly  recognizing  spoken  words, 
it  is  probably  true  that  decisions  on  the  presence  or  absence  of  words  should 
be  based  on  intervals  of  observed  speech  which  span  the  longest  word  in  the 
given  vocabulary.  Furthermore,  to  maximize  the  information  obtainable 
from  an  interval  of  observed  speech  for  the  purpose  of  deciding  which  (if 
any)  of  a  given  list  of  words  has  been  spoken,  no  intermediate  decisions 
shWid  be  made.  From  both  of  these  standpoints,  die  recognition  of  phonetic 
elements  as  a  preliminary  to  word  recognition  tends  to  degrade  slightly  the 
potential  for  achieving  accurate  word  recognition  for  a  given  vocabularly. 
However,  as  pointed  out  previously,  any  attempt  to  utilize  words  as  the 
basic  language  elements  for  transforming  speech  into  readable  text  creates 
intolerable  restrictions  on  the  allowable  speech  which  can  be  transformed, 
involves  basic  difficulties  in  changing  vocabulary,  and  requires  that  initial 
decisions  be  rendered  between  a  far  larger  number  of  alternatives « -thus 
increasing  equipment  complexity  significantly.  Therefore,  we  must  be 
content  with  achieving  whatever  performance  is  attainable  through  the  use 
of  sequences  of  sounds  as  the  starting  point  for  word  recognition, 
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Two  approaches  have  been  considered  for  processing  sequenGes  of 
sounds  to  recognize  words i  With  the  first  approachi  a  number  is  assigned 
to  each  sound  in  such  a  way  that  sequences  of  sounds  corresponding  to  dif¬ 
ferent  words  should  be  maximally  differentiable  from  each  other  by  the 
decision  rule  with  which  words  are  to  be  recognizedi  If,  for  instance,  the 
word  recognition  method  consists  of  correlating  sequences  of  sounds  with 
stored  sequences,  each  of  which  represents  a  sound,  then  such  a  numerical 
assignment  of  numbers  to  sounds  can  have  a  relatively  simple  solution. 
Specifically,  if  all  of  the  words  to  be  recognized  are  so  different  as  to  pro¬ 
duce  uncorrelated  sequences  of  sounds  (if  transcribed  perfectly),  then  numbers 
should  be  assigned  to  sounds  so  that  the  variance  of  numbers  corresponding  to 
first  sounds  of  all  words  in  the  vocabulary  is  maximized.  Similarly,  the 
variance  of  numbers  associated  with  subsequent  sounds  in  a  perfect  transit 
cription  should  also  be  maximized. 

The  secondapproach  to  the  assignment  of  numerical  values  to  sounds 
is  based  on  engineering  considerations  aimed  at  making  the  electronic  imple¬ 
mentation  of  word  recognition  particularly  simple.  Assume  that  so\mds 
occurring  in  a  specific  word  are  assigned  numerical  values  in  agreement 
with  the  chronological  sequence  in  which  these  sounds  occur  in  the  word. 

For  instance,  in  the  word  "art",  transcribed  "AH  UR  T",  if  we  assign  numbers 
to  the  3  different  sounds  so  that  AH  =  1,  UR  =  2,  and  T  =  3,  then  a  rudimentary 
esact  match  transcription  of  the  word  "art"  might  appear  as 

AH  AH  AH  AH  UR  UR  UR  T  T 
1  I  I  1  2  2  2  3  3 

When  associated  with  other  words,  the  sounds  AH,  UR,  and  T  may  be  assigned 
different  numerical  values  so  that,  in  the  particular  word  in  question,  numbers 
assigned  to  sounds  form  a  monotonically  increasing  sequence.  This  assign¬ 
ment  is  readily  implemented  by  assigning  to  each  sound  recognition  output 
unit  (flip-flop),  a  voltage  divider,  v/here  each  tap  on  the  divider  is  routed  to 
different  word-recognition  units.  Numerical  values  of  voltages  appearing  at 
the  taps  correspond  to  the  position  of  that  sound  in  the  sequence  of  sounds  in 
the  word  to  which  the  output  of  the  tap  is  routed.  Hence,  as  shown  in  Figure 
15,  the  machine  implemented  by  the  above  description  consists  of  a  number  of 
different  parts. 

The  machine  will  be  described  by  reference  to  a  specific  example, 
wherein  the  recognition  of  only  2  words,  the  word  "art"  and  the  word  "tar" 
are  required.  Ideal  transcriptions  of  these  two  words  contain  three  basic 
sounds,  With  the  recognition  of  each  there  is  associated  a  flip-flop  labeled 
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FF(AM),  FF(UR),  FF(T).  As  a  result  of  each  SpeeGh  sample,  usually  only 
one  of  the  sound  recognition  flip-flops  will  be  ON.  It  thus  generates  a  pair 
of  voltages  at  the  two  taps  of  the  voltage  dividers  labeled  AHj^  and  URj 

and  IJSI2’  ^1  which  flip-flop  drives  the  attenuator. 

The  suhsGripts  inmeate  to  which  word  recognition  device  (one  or  two,  cor¬ 
responding  to  the  words  "art"  and  "tar")  the  divider  outputs  are  routed.  The 
numerical  values  of  the  coe^icients  signify  the  position  of  the  corresponding 
sound  in  the  word  whose  number  is  denoted  by  the  subscripts  of  the  coefficient. 
Coefficients  with  like  subscripts  are  added,  resulting  in  the  occurrence  of  a 
monotonic  sequence  of  voltages  at  the  output  of  that  adder  which  corresponds 
to  the  word  presently  uttered.  Since  sound  sequences  eorresponding  to  dif¬ 
ferent  words  will  not  be  identical  for  good  transcriptions,  only  one  of  the 
summing  devices  will  have  a  monotonic  voltage  output  as  a  function  of  time. 
This  will  occur  in  the  particular  device  which  corresponds  to  the  wordbeing 
spoken. 

The  differentiator  that  follows  each  summing  device  will  have  an 
output  which  consists  of  a  sequence  of  positive  impulses  if  the  right  sequence 
of  sounds,  corresponding  to  the  word  of  present  interest,  is  uttered.  Mul¬ 
tiple  successive  occurrences  of  identical  sounds  will  result  in  a  differentiator 
output  that  still  only  consists  of  positive  impulses,  except  that  impulses  will 
be  missing  at  times  corresponding  to  the  multiple  occurrence  of  identical 
sounds.  The  occurrence  of  negative  impulses  in  any  of  the  differentiator 
outputs  indicates  that  the  word  corresponding  to  the  particular  summer 
differentiator  probably  did  not  occur.  Eecognition  of  a  word  should  be  based 
on  a  comparison  of  the  numerical  values  of  the  output  of  a  Set  of  low-paSS 
filters  that  follow  the  differentiators.  The  output  of  each  low^pass  filter  is 
proportional  to  the  number  of  positive  impulses  minus  the  number  of  negative 
impulses  that  occurred  at  the  output  of  the  differentiator  over  a  period  of  time 
eqaal  to  the  word  length.  Thus  the  word  which  resulted  in  the  least  number  of 
errors  in  the  expected  sound  sequence  is  said  to  have  been  spoken. 

Since  a  thorough  evaluation  of  either  of  these  approaches  to  word 
recognition  can  be  oonducted  only  after  a  parameter  space  which  separates 
essentially  all  speech  sounds  has  been  constructed,  tests  have  been  confined 
during  this  project  to  the  easily  simulated,  engineering  approach,  A  test 
Word  list  consisting  of  25  words  has  been  selected,  and  exact  match  trans<^ 
criptions  have  been  used  as  inputs  to  25  word  recognition  units  designed  in 
accordance  with  the  illustration  in  Figure  IS. 
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Several'  considerations  have  entered  into  the  selection  of  a  test 
vocabulary.  Firstj  for  a  specified  voeabulary  sine,  word  recognition  tends 
to  be  easier  to  perform  if  the  length  and  variation  in  length  of  words  are 
largei  Therefore,  to  ■insure  that  a  high  level  of  difficulty  is  established  for 
testing  word  recognition  and  transcription  sehemes,  short  test  words  should 
be  selected  for  the  test  vocabulary. 

For  a  given  vocabulary  size,  word  recognition  tepds  to  be  less 
difficult  if  many  different  sounds  are  involved  in  the  wordSi  On  the  other 
hand,  so^d  reGOgnition  may  become  more  difficult  as  the  humber  of  allowable 
speech  sounds  is  increased  in  a  vocabulary.  Since  the  test  vocabulary  is  to 
be  used  both  as  a  means  of  evaluating  speech  transcriptions  and  word  recog¬ 
nition  methods,  we  have  chosen  not  to  limit  the  sou.ids  in  a  transcription  to 
those  involved  in  a  test  word  vocabulary.  Therefore,  sound  recognition 
capability  is  made  independent  of  the  test  word  vocabulary,  and  will  not  be 
affected  by  the  distribution  of  sounds  in  these  words.  At  the  same  time,  we 
have  chosen  to  select  a  vocabuiary  such  that  each  word  is  not  only  short  but 
tends  to  sound  like  a  few  of  the  other  words  in  the  voeabulary,  so  that  word 
recognition,  even  by  a  human,  may  be  a  significantly  difficult  problem*  The 
Test  Word  List  (TWh)  appears  in  Table  19. 


TABLE  19.  TEST  WORD  LIST 


ZERO 

NINE 

START 

ONE 

PLU$ 

ALPHA 

TWO 

MINUS 

BETA 

three 

TIMES 

X 

FOUR 

PRINT 

Y 

five 

equals 

% 

SIX 

STOP 

REPEAT 

SEVEN 

EIGHT 

POINT 

THROUGH 

This  vocabulary  was  selected  to  illustrate  typical  commands  and  data  for  use 
in  a  computer^ input  application,  as  well  as  to  satisfy  the  <|ialitative  desiderata 
discussed  above. 


The  ten  spoken  numerals  have  been  used  in  the  past*  as  test  words. 
The  other  15  words  in  the  Test  Word  List  include  several  word  groups  with 
common  voiced  and  unvoiced  sounds.  The  relative  frequencies  of  occurrence 

*F  or  ins  tanc  e ,  [  7  ] ,  [  1 1] , 


of  sounds  in  the  TWL  are  shown  in  Table  20,  along  with  the  relative  frequenGy 
Of  OGcurrenee  of  sounds  in  conversational  speech.  *  Although  no  major  effort 
was  made  to  match  the  distribution  of  sounds  in  the  TWL  precisely  with  their 
distribution  in  conversational  speech,  a  close  correspondence  was  obtained 
for  all  but  six  of  die  sounds.  The  23  sounds  involved  in  the  TWL  account  for 
approximately  87  percent  of  sound  occurrences  in  conversational  speech, 


To  obtain  a  statistically  significant  indication  of  word  recognition 
performance,  five  different  utterances  of  each  word  in  the  test  word  list  were 
transcribed  with  the  rudimentary  exact  match  method  (interpolating  non=vowel 
sounds),,  and  each  of  the  resulting  125  sequences  of  sounds  were  processed 
through  each  of  25  word  recognition  units.  The  transcriptions  were  performed 
by  computer  siniulation,  and  the  word  recognition  units  were  simulated  by  hand 
calculations.  Approximately  80  percent  correct  identification  of  words  was 
obtained,  using  only  peak-patterns  as  the  extracted  parameters.  While  no 
tests  were  possible  using  spectral  moments,  as  well  as  spectral  peaks,  it  is 
anticipated  that  these  simple  exact  match  transcription  and  word  recognition 
methods  Will  produce  greater  than  90  percent  correct  identification  of  words, 
taking  into  account  unvoiced,  as  well  as  voiced  sounds. 


*As  derived  from  Table  15  in  [4]  ,  p,  96, 
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TABLE  20.  RELATIVE  FREQUENCY  OF  OCCURRENCE  OF  SOUNDS 

I 


5.  CONCLUSiONS  AND  RECOMMENDATIONS 


The  basic  approach  to  speecb  transcription  investigated  on  this 
project  consists  of  (1)  representing  speech  signals  as  sequences  of 
periodic  sarnple  patterns  of  parameter  values,  called  "instantaneous 
spectra",  and  (2)  associating  pbonetic  language  elenoents  with 
selected  sets  of  patterns.  To  ascertain  storage  requirements  and 
obtain  estimates  of  the  accuracy  with  which  speech  sounds  can  be 
represented,  laboratory  speech  processing  equipment  (Figure  5)  has 
been  utilized  to  obtain  data  on  several  parameters  (Table  9)i  and  the 
representation  of  speech  sounds  in  the  parameter  spaces  constructed 
from  two  eombiaations  of  these  parameters  has  been  investigated. 
Methods  of  associating  patterns  of  parameter  values  with  speech  sounds^ 
and  sequences  of  speech  sounds  with  words,  have  also  been  examined. 
Although  these  methods  were  selected  primarily  on  the  basis  of  ease 
of  iastrumentation,  they  exhibit  high  potential  for  providing  accurate 
transcriptions  and  word  recognition.  Salient  GonGlusions  and  recoms. 
mendations  for  further  developrnent  of  these  methods  are  presented 
in  the  following  paragraphs. 


With  respect  to  accuracy.  Tables  12,  13,  and  14  indicate  that 
parameter  spaces  constructed  from  spectral  peaks  and  a  few  other 
parameters  reflecting  spectral  shape  of  speech  samples  can  be  expected 
to  provide  good  Separation  of  vowels  and  other  voiced  sounds. 

Specifically,  the  average  estimated  probability  of  correctly  identifying 
the  vowel  sound  from  which  a  single  17  msec  speech  sample  is  taken, 
is  approximately  0,  74,  using  spectral  peaks  alone  (peak  space), 
Augmentation  of  peak  space  with  the  first  two  spectrum  moments 
increases  the  estimated  probability  of  correctly  identifying  a  single 
vowel  sample  to  0.86,  If  the  "plurality,. rule"  method  (Section  4.  1)  of 
combining  speech  samples  within  segments  corresponding  to  single 
speech  sounds,  is  used  to  reduce  the  number  of  decisions  rendered  per 
unit  time,  then  these  individual  sample  probabilities  could  be  expected 
to  produce  a  probability  of  correct  deci^i©®  (f®^  vowels)  of  0,  90  and  0.  98, 
for  peak., space  and  peak., moment  space,  respectively,* 


These  figures  are  based  on  the  assumption  that  an  average  of  five 
speech  samples  occur  within  a  speech  segment. 
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Recognition  of  25  or  more  words  by  processing  sequences  of  transcribed 
Speech  sounds  can  be  performed'  with  relatively  simple  equipment  (Figure 
r5)i  The  accuracy  attainable  is  expected  to  be  quite  high  when  several 
additional  parameters  are  measured  in  Conjunction  with  Spectral  peaks. 

Using  Spectral  peaks  alone  with  the  rudimentary  transcription  method  for 
vowel  sounds  and  interpolating  non* vowel  sounds^  an  80  percent  probability 
of  Correct  recognition  of  one  of  25  words  has  been  obtained  with  the  most 
easily  instrumented  word  recognition  methods 


The  number  of  different  patterns  of  parameter  values  which  can  occur 
in  speech  within  an  interval  corresponding  to  a  single  decision  serves  as 
an  indication  of  the  efficiency  with  which  speech  signals  are  being  pro* 
cessedi  as  well  as  the  complexity  of  equipment  required  to  render  the 
decision  aufomatiGally.  With  the  rudimentary  exact  match  method  of 
associating  a  speech  sound  with  each  speech  sample,  the  number  of  different 
patterns  of  parameter  values  is  quite  small.  Using  only  spectral  peaks  in 
an  18  channel  vocoder,  for  instance,  there  are  less  than  7000  different 
patterns  which  are  possiblei  This  would  indicate  that  less  than  13  bits 
of  information  are  utilized  for  each  decision.  Moreover,  taking  into 
account  the  fact  that  not  all  possible  patterns  of  parameter  values  are 
produced  by  speech  signals,  the  information  processed  for  each  decision 
is  even  less,  With  spectral  peaks,  for  instance,  it  is  estimated  (Section 
3.  1)  that  no  more  than  approximately  400  different  spectral  peak  patterns 
would  ever  occur  in  vowel  sounds;  i*  e,  only  9  bits  per  decision  would  be 
required  for  vowel  sounds*  With  the  addition  of  other  speech  parameters 
the  information  storage  requirements  would  increase,  but  evidently  not 
drastically.  With  the  addition  of  the  first  two  spectral  moments  (properly 
quantized  as  indicated  in  Figure  9),  it  appears  that  three  additional  bits 
would  suffice. 

Implementation 


From  the  standpoint  of  implementing  an  exact  match  transcription 
method,  a  reference  library  eonsisting  of  1000  patterns  can  be  handled 
quite  easily.  The  exploitation  of  either  "always"  or  "never"  conditions 
for  most  of  the  binary  quantities  involved  in  patterns  of  parameter  values, 
produces  a  decision  "tree’"  with  only  a  few  nodes  and  branches.  This 
transcription  method  can  be  implemented  readily  with  diode  niatrices  or 
relays. 
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Word  f  ecognition  units  can  fee  Gonstfucted  readily  fey  tfee  fnetfeod 
indiGated  in  Figure  15*  It  should  be  stressed  that  by  first  tranSGrifeing 
speech  into  sequenGes  of  speeGh  sounds,  essentially  all  restriGtions  on 
the  number  and  type  of  different  words  whiGh  Gan  be  handled  are  lifted* 

Of  Gourse,  performanGe  will  tend  to  fee  degraded  as  the  number  of  words 
it  is  desired  to  distinguish  between  iaereases,  but  the  Gonstruetion  of 
word  reGognition  units  oan  proGeed  independeatly  of  the  tranSGription 
method  feeing  employed. 

Reo  ommendatioas 

The  data  GolleGtion  and  analysis  program  reported  here  primarily 
for  voioed  sounds  should  be  carried  out  for  the  remaining  speeeh  sounds. 
This  would  produee  a  eomplete  indieation  of  the  transeription  acGuraGy 
attainable  with  speetral  peaks  and  speGtrmn  moments. 


Two  Gourses  for  improving  transeription  aGcuraey,  augmentation 
of  parameters  and  modifieation  of  reGOgnition  methods  (diSGUssed  in 
Section  4.  1),  should  fee  pursued  in  the  following  way.  First,  additional 
speeeh  parameters  should  be  introdueed  to  produGe  a  parameter  space 
in  whiGh  all  speech  sounds  are  widely  separated,  in  addition  to  normalizas 
tion  of  the  speetral  moments,  the  following  parameters  deserve 
examination: 

(1)  Derivative  of  hJormalized  SpeeGh  Envelope 

(2)  Silence  Indication 

(3)  Low  Frequency  First  and  Second  Moments 

(4)  High  Frequency  First  and  Second  Moments 

(5)  Duration  of  Unvoiced  Intervals 

(6)  Formant  Time  Derivative  Polarity 

With  the  addition  of  some  of  these  parameters,  the  rudimentary  exact 
match  transcription  method,  with  smoothing  (see  Section  4,  1),  should  produce 
aeceptable  transcriptions  for  the  majority  of  speech  Sounds, 

To  attain  a  readable  transcription  for  all  members  of  a  phonetic 
alphabet,  it  may  be  necessary  to  introduce  another  method  of  recognition. 
Through  the  use  of  the  parameter,  AS  (Seetion  3.  1.  1),  speech  may  be 
segmented  into  short  intervals  corresponding  to  either  utterances  of 
speech  sounds,  or  portions  of  speecb  sounds.  By  combining  all  of  the 
patterns  of  parameter  values  occurring  in  a  given  segment,  a  more 
reliable  decision  can  be  rendered.  As  suggested  in  Section  4.  1,  several 
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methods  of  Gomhining  the  samples  ocGuffing  within  a  segment  should 
be  investigated  thoroughly^  including  correlation  of  cumulative 
Spectral  peak  counts  with  replicas  of  spectral  profiles^  and  plurality 
rule  of  sounds  within  each  segments 
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APPENDIX  I 


Program  for  Simulating  a  Peak-Picking  Formant  Tracking  Vocoder 


The  input  to  the  program  consists  of  a  sequence  of  "instantaneous  spectra" 
(samples  of  a  vocoder  output  taken  every  A  seconds)  representing  an 
isolated  spoken  word.  The  number  of  Spectra  in  each  utterance  depends 
upon  the  duration  of  the  word.  Each  spectrum  is  in  18  chanhel  vocoder 
format  with  the  energy  in  each  Ghannel  quantized  in  3  bits,  and  is  repre* 
sented  by  the  quantities,  a^, . . . ,  aj*g.  The  object  of  the  program  is  to 
locate  for  each  spectrum  the  frequeney  channels  in  which  the  energy  exhibits 
a  local  maximum.  The  output  for  each  spectrum  consists  of  18  bits,  one 
for  each  vocoder  channel,  where  a  "one"  indicates  a  peak  and  a  "zero", 
no  peak,  in  the  corresponding  channel,  in  addition,  one  bit  for  the  voiced- 
unvoiced  decision  and  three  bits  for  the  number  of  peaks  are  included,  A  flow 
for  this  program  is  shown  in  Figure  16.  . 

The  method  of  locating  the  local  peaks  may  be  described  briefly  as  follows: 

There  is  a  peak  in  channel  n  if  a  >  a, , ,  and  a  >  a  , ,  a.  and 
^  n  n+1  n  n-1  0 

a.Q  are  assumed  equal  to  0,  to  allow  peaks  at  the  ends.  If  there  are 

several  channels  of  equal  magnitude  surrounded  by  channels  of  smaller 

magnitude,  there  are  two  alternatives.  If  the  number  of  equal  channels  is 

odd  the  peak  is  placed  in  the  middle  channel.  If  the  number  is  even  the 

middle  lies  between  two  channels.  In  this  case  the  peak  is  placed  on  the 

side  of  the  middle  which  has  the  largest  surrounding  channeli  of  if  the  two 

surrounding  channels  are  equal  the  peak  is  placed  arbitrarily  on  the  low 

frequency  side, 

A  result  of  the  peak-picking  operation  is  shown  in  Figure  17,  . 

The  voiced-unvoiced  decision  is  made  using  a  linear  discriminant. 
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APPENJDIX  n 


Ten  Moat  Frequently  OcGurring  Patterns  in  Peak 
iGe  and  Peak-Monaent  Sipace,  For  each  Vowel 
Sound  and  a  Single  Speaker 


A. 


PEAK  SPACE 


The  ten  most  frequently  occurring  patterns  of  values  of  the 
local  spectral  peaks  are  listed  below  for  each  of  the  eleven  vowel  sounds 
listed  in  Table  1. 
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Spectral  Peaks  Relative 
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B.  Peak- Moment  Space 
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The  ten  most  frequently  OGourrinf  patterns  of  values  of  the  local 
spectral  peaks  and  the  first  three  spectral  moments  are  listed  below 
for  each  of  the  eleven  vowel  sounds  listed  in  Table  1. 
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APPENDIX  III 

A  Program  for  Mapping  Peak* Picked  Spectra 
into  a  Reduced  Space 


The  purpose  of  this  prografn  is  the  SirnultaneOUS  generation  of  an 
"intermediate  reference  library"  of  "instantaneous  Spectra"  in  the  18 
bit  peak^picked  format,  and  recording  of  speech  data  as  a  sequence  of 
intermediate  reference  library  numbers- 

ThiS  program,  designated  as  SMREF,  sets  up  a  library  of  reference 
patterns  for  speech  data  based  on  the  following  rules  for  similarity 
of  two  input  spectra: 

1.  Voiced*unvoiced  designation  must  be  the  same* 

2,  The  number  of  peaks  must  be  the  same. 

3*  If  the  number  of  peaks  is  zero  or  One,  the  Spectra  must 
be  identical* 

4*  If  the  number  of  peaks  is  greater  than  one  but  less  than  seven, 
GorrespOnding  peaks  of  one  spectrum  must  not  be  more  than  one  channel 
away  from  those  of  the  other  and  the  direction  of  the  shift  in  peak  locations 
must  be  the  same. 

Input  is  a  series  of  tapes;  the  first  record  in  each  section  indicates 
the  number  of  vectors  to  follow,  where  each  vector  is  a  one  word  record 
describing  the  peak  patterns,  i.  e. ,  the  location  of  the  peaks,  the  voicing 
indication  and  the  number  of  peaks.  The  input  is  compared  against  all 
previously  established  reference  patterns,  If  a  match  if  found,  the 
"matching  count"  for  the  reference  is  up=dated.  If  no  match,  the  input 
is  stored  as  a  new  reference  pattern.  For  each  input  the  number  of  the 
matching  reference  spectrum  is  typed. 
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After  all  input  spectra  have  heen  examined,  the  library  is  aorted  and  a 
three  sectional  tape  is  punched.  The  first  contains  the  unvoiced  patterns 
arranged  according  to  nvunber  of  peaks  followed  by  the  "count"  of  unvoiced 
patterns.  The  second  consists  of  the  same  data  for  the  voiced  sounds. 
Section  three  is  the  unsorted  reference  library  and  all  necessary  controls 
for  Continuing  the  library  generation  at  a  later  date.  A  copy  of  the  sorted 
libraries  is  also  typed. 

There  are  the  following  restrictions: 

1.  Program  is  designed  for  eighteen  channel  data  with  a 
maximum  of  six  peaks. 

2,  There  are  approximately  3,  000  (decimal)  loeations  reserved 
for  the  reference  pattern  library.  If  an  extraordinary  number  of  input 
vectors  is  used,  there  is  a  possibility  of  exceeding  this  space.  (Loc.  0045.  1 
indicates  the  storage  location  for  the  storage  location  for  the  next  reference 
pattern; .  This  should  not  exceed  6777.  0). 

The  program  has  been  written  for  the  Recomp  II  Computer  for 
Contract  AF30(602)*2641j  February,  1962. 

Flow  charts  for  this  program  are  shown  in  Figures  18,  19,  and 

20. 
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Figure  19.  Flow  Chart  for  Matching  (Subroutine  for  SMERF) 


Output  Section  for  SMREF 
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